This is assignment is due by 5pm, Tuesday 26th May.
All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.
Marks directly correspond to the amount of time and effort we expect for each question, so please answer with this is mind.
We strongly advise working in the folder ~/transcriptomics/assignment4
on your virtual machine. Using an R Project for each individual assignment is also strongly advised.
If all files required for submission are contained on your VM:
.zip
If all files are on your on your local Windows machine:
Send to > Compressed (zipped) folder
.zip
If all the files are on your *local macOS machine`:
For microarray data we are able to assume normality, whilst for RNA-seq we are not. Describe briefly why this is and what types of distributions and models we are able to use instead.
Normalisation in RNA-seq can be performed using offsets in the model fitting stages, with two of the most common approaches being TMM and CQN. Briefly describe which technical issues the two methods are attempting to overcome. Whilst there is much mathematical detail in these papers which is beyond the scope of this course, the papers describing the two approaches are available here for TMM and here for CQN. Feel free to use these papers whilst formulating your answer.
source("https://uofabioinformaticshub.github.io/transcriptomics_applications/assignments/A4Funs.R")
makeGsTestData("a1234567")
For this question, please run the above code to generate your test dataset, using your own student ID instead of the one provided. This will place an object called gsTestData
in your R Environment. Once you have this object, please perform the following tasks
pander()
. Missing column names in the first printed column are expectedfisher.test()
and provide the output in your RMarkdownlibrary(tidyverse)
library(pander)
myMat <- gsTestData %>%
pivot_wider(
id_cols = inGeneSet,
names_from = DE,
values_from = Total,
names_prefix = "DE_"
) %>%
as.data.frame() %>%
column_to_rownames("inGeneSet") %>%
as.matrix()
myMat
## DE_TRUE DE_FALSE
## TRUE 73 435
## FALSE 517 3056
pander(myMat)
DE_TRUE | DE_FALSE | |
---|---|---|
TRUE | 73 | 435 |
FALSE | 517 | 3056 |
fisher.test(myMat)
##
## Fisher's Exact Test for Count Data
##
## data: myMat
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.7502549 1.2977600
## sample estimates:
## odds ratio
## 0.9919664
The above tests the hypothesis:
\(H_0\): There is no association between DE status and the Gene Set, vs
\(H_A\): There is an association between DE status and the Gene Set
As \(p \approx 1\) we would accept the null hypothesis and conclude that there is no association between this gene set and our set of DE genes.
For this question you will need the dataset provided for you here, which is a comparison of splenic and pancreatic Tregs from Mus musculus.
topTable
.logFC
or simply choose the top n
-ranked genes by p-value. Explain your reasoning.goseq
package. If any discernible pattern is present choose one of these as the bias you are choosing to offset and explain why.goseq()
choosing one of the collections available in the msigdbr
package. Justify any filtering of gene-sets that you undertake, or why you have chosen not to. A list of possible collections is provided on the help page for msigdbr
. Please note that is using the categories C2, C3 or C5 you will be expected to choose only one of the subcategories. A full list with complete descriptions is available heren
-ranked genes within each geneset as a separate column. Summarise your results very briefly in plain text.If working with others in a group, please ensure that you choose different gene-set collections for your submission.
library(goseq)
library(msigdbr)
library(magrittr)
topTable <- "https://uofabioinformaticshub.github.io/transcriptomics_applications/assignments/data/topTable.csv" %>%
url() %>%
read_csv()
head(topTable)
## # A tibble: 6 x 9
## gene_id gene_name logFC logCPM PValue FDR entrezid aveLen aveGc
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSMUSG0000002… Cela1 7.16 9.02 3.17e-69 3.36e-65 109901 639. 54.0
## 2 ENSMUSG0000003… Tent5a 1.30 7.06 1.08e-64 5.74e-61 212943 5564. 43.2
## 3 ENSMUSG0000004… Ctla2a 2.67 5.06 6.42e-55 2.27e-51 13024 792. 45.1
## 4 ENSMUSG0000002… Fosl2 1.86 5.12 2.06e-52 5.47e-49 14284 3905. 53.6
## 5 ENSMUSG0000002… Ttc39c 1.22 5.15 1.23e-34 2.61e-31 72747 1977. 49.9
## 6 ENSMUSG0000004… Camk2n1 1.98 4.86 1.20e-31 2.13e-28 66259 4444 53.7
minLFC <- 0.5
fdr <- 0.01
In order to decide on which genes were DE, I first decided to make a volcano plot. After a few experiments, I decided on an FDR of 0.01 and a logFC filter of 0.5 for considering a gene as DE. This seemed a fair compromise between accepting genes with a small amount of differential expression and being statistically conservative.
topTable$DE <- topTable$FDR < fdr & abs(topTable$logFC) > minLFC
topTable %>%
ggplot(aes(logFC, -log10(PValue), colour = DE)) +
geom_point() +
geom_vline(xintercept = c(-1, 1)* minLFC, linetype = 2) +
scale_colour_manual(values = c("grey", "red")) +
scale_x_continuous(breaks = seq(-4, 8, by = 2)) +
theme_bw()
This gave a total of 639 DE genes.
deVec <- setNames(topTable$DE, topTable$gene_id)
head(deVec)
## ENSMUSG00000023031 ENSMUSG00000032265 ENSMUSG00000044258 ENSMUSG00000029135
## TRUE TRUE TRUE TRUE
## ENSMUSG00000024424 ENSMUSG00000046447
## TRUE TRUE
pwfLen <- nullp(deVec, bias.data = topTable$aveLen, plot.fit = FALSE)
pwfGC <- nullp(deVec, bias.data = topTable$aveGc, plot.fit = FALSE)
par(mfrow = c(1, 2))
plotPWF(pwfLen, xlab = "Gene Length")
plotPWF(pwfGC, xlab = "GC Content")
par(mfrow = c(1, 1))
For enrichment analysis, I’m going to choose the KEGG gene-sets
kg <- msigdbr(species = "Mus musculus", category = "C2", subcategory = "CP:KEGG") %>%
inner_join(
dplyr::select(topTable, gene_id, entrez_gene = entrezid)
)
kgByGene <- kg %>%
dplyr::distinct(gs_name, gene_id) %>%
split(f = .$gene_id) %>%
lapply(extract2, "gs_name")
head(kgByGene)
## $ENSMUSG00000000001
## [1] "KEGG_AXON_GUIDANCE"
## [2] "KEGG_CHEMOKINE_SIGNALING_PATHWAY"
## [3] "KEGG_GAP_JUNCTION"
## [4] "KEGG_LEUKOCYTE_TRANSENDOTHELIAL_MIGRATION"
## [5] "KEGG_LONG_TERM_DEPRESSION"
## [6] "KEGG_MELANOGENESIS"
## [7] "KEGG_PROGESTERONE_MEDIATED_OOCYTE_MATURATION"
## [8] "KEGG_TIGHT_JUNCTION"
##
## $ENSMUSG00000000028
## [1] "KEGG_CELL_CYCLE"
##
## $ENSMUSG00000000088
## [1] "KEGG_ALZHEIMERS_DISEASE" "KEGG_CARDIAC_MUSCLE_CONTRACTION"
## [3] "KEGG_HUNTINGTONS_DISEASE" "KEGG_OXIDATIVE_PHOSPHORYLATION"
## [5] "KEGG_PARKINSONS_DISEASE"
##
## $ENSMUSG00000000142
## [1] "KEGG_BASAL_CELL_CARCINOMA" "KEGG_COLORECTAL_CANCER"
## [3] "KEGG_ENDOMETRIAL_CANCER" "KEGG_PATHWAYS_IN_CANCER"
## [5] "KEGG_WNT_SIGNALING_PATHWAY"
##
## $ENSMUSG00000000149
## [1] "KEGG_LONG_TERM_DEPRESSION"
## [2] "KEGG_MAPK_SIGNALING_PATHWAY"
## [3] "KEGG_REGULATION_OF_ACTIN_CYTOSKELETON"
## [4] "KEGG_VASCULAR_SMOOTH_MUSCLE_CONTRACTION"
##
## $ENSMUSG00000000168
## [1] "KEGG_CITRATE_CYCLE_TCA_CYCLE" "KEGG_GLYCOLYSIS_GLUCONEOGENESIS"
## [3] "KEGG_PYRUVATE_METABOLISM"
goseqRes <- goseq(pwfLen, gene2cat = kgByGene) %>%
as_tibble() %>%
mutate(
Expected = round(sum(topTable$DE)*numInCat / length(kgByGene), 1),
FDR = p.adjust(over_represented_pvalue, "fdr")
) %>%
dplyr::select(
category,
Expected,
nDE = numDEInCat,
nGenes = numInCat,
p = over_represented_pvalue,
FDR
)
goseqRes %>%
dplyr::filter(FDR < 0.01) %>%
left_join(kg, by = c("category" = "gs_name")) %>%
dplyr::select(
any_of(colnames(goseqRes)), gene_symbol, gene_id
) %>%
dplyr::filter(gene_id %in% names(which(deVec))) %>%
dplyr::select(-gene_id) %>%
chop(gene_symbol) %>%
mutate(
gene_symbol = vapply(gene_symbol, paste, character(1), collapse = "; "),
category = str_remove_all(category, "KEGG_"),
category = str_replace_all(category, "_", " "),
category = str_to_title(category),
p = sprintf("%.2e", p),
FDR = sprintf("%.2e", FDR)
) %>%
dplyr::rename(
Category = category,
Genes = gene_symbol
) %>%
pander(
justify = "lrrrrrl",
split.tables = Inf,
split.cells = 30,
caption = "Enriched KEGG pathways to an FDR of 0.01, with DE genes from each pathway shown in the final column"
)
Category | Expected | nDE | nGenes | p | FDR | Genes |
---|---|---|---|---|---|---|
Cytokine Cytokine Receptor Interaction | 19.1 | 24 | 85 | 1.67e-09 | 3.09e-07 | Acvr2a; Ccl5; Ccr2; Ccr3; Ccr4; Ccr5; Ccr7; Ccr9; Csf2ra; Cxcr4; Fasl; Ifngr1; Il10; Il17rb; Il18rap; Il2rb; Il3ra; Il6ra; Il9r; Tgfb3; Tnfrsf10b; Tnfrsf10b; Tnfrsf10b; Tnfrsf10b; Tnfsf11; Tnfsf14; Tnfsf8 |
Hematopoietic Cell Lineage | 7.9 | 15 | 35 | 3.94e-09 | 3.64e-07 | Cd44; Cd7; Cd8a; Cd8b1; Cd9; Csf2ra; Il3ra; Il6ra; Il9r; Itga1; Itga2; Itga3; Itga4; Itga6; Itgb3 |
Arrhythmogenic Right Ventricular Cardiomyopathy Arvc | 7.2 | 14 | 32 | 1.76e-08 | 1.08e-06 | Actn1; Cacnb3; Cacnb4; Ctnna1; Dsg2; Itga1; Itga2; Itga3; Itga4; Itga6; Itgb3; Itgb5; Lmna; Tcf7 |
Ecm Receptor Interaction | 6.7 | 13 | 30 | 9.33e-08 | 4.31e-06 | Cd44; Col1a1; Col1a2; Col6a2; Itga1; Itga2; Itga3; Itga4; Itga6; Itgb3; Itgb5; Lama5; Sdc4 |
Neuroactive Ligand Receptor Interaction | 9 | 13 | 40 | 3.42e-06 | 1.27e-04 | Chrnb1; Chrnb2; Cysltr1; Cysltr2; Glp1r; Gpr83; Ltb4r1; Nr3c1; P2ry14; Ptafr; Ptgir; Tbxa2r; Vipr1 |
Dilated Cardiomyopathy | 8.7 | 12 | 39 | 1.65e-05 | 5.08e-04 | Adcy7; Cacnb3; Cacnb4; Itga1; Itga2; Itga3; Itga4; Itga6; Itgb3; Itgb5; Lmna; Tgfb3 |
Hypertrophic Cardiomyopathy Hcm | 8.3 | 11 | 37 | 4.46e-05 | 1.18e-03 | Cacnb3; Cacnb4; Itga1; Itga2; Itga3; Itga4; Itga6; Itgb3; Itgb5; Lmna; Tgfb3 |
highest | median |
---|---|
9.6 | 5.6 |