This is assignment is due by 5pm, Tuesday 12th May.
All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.
Marks directly correspond to the amount of time and effort we expect for each question, so please answer with this is mind.
We strongly advise working in the folder ~/transcriptomics/assignment2
on your virtual machine. Using an R Project for each individual assignment is also strongly advised.
If all files required for submission are contained on your VM:
.zip
If all files are on your on your local Windows machine:
Send to > Compressed (zipped) folder
.zip
If all the files are on your *local macOS machine`:
Two common strategies for RNA Seq library preparation are the depletion of rRNA molecules or the preferential amplification of poly-adenylated RNA. Briefly contrast these two approaches, describing their respective strengths and limitations.
Any of the following points up to the total of 8 marks will be accepted, along with any other suitable comparisons
A common alignment and quantification workflow is to align an RNA Seq sample to a reference genome and count reads which align to each gene, as defined in a gtf
file.
A transcriptomic experiment was designed to test differences in gene expresson based on loss-of-function mutations in a specific gene (myGene). Three genotypes: Wild-Type, Heterozygous and Homozygous were analysed, and these may also be described as myGene+/+, myGene+/- and myGene-/-. For an experiment with \(n = 4\) samples from each genotype, describe two possible approaches detailing advantages of each over the other. Include code for generating a model matrix and a contrast matrix based on the following layout.
You should start the coding section by copying the following code to generate the appropriate metadata object.
genoData <- tibble(
sampleID = paste0("S", 1:12),
replicate = rep(1:4, 3),
myGene = rep(c("+/+", "+/-", "-/-"), each = 4),
genotype = rep(c("WT", "Het", "Hom"), each = 4)
)
(Hint: Consider how to set a categorical variable in R)
We could take a few approaches here, with two of these being setting a reference genotype, such as the wild-type samples, and assessing the impact of the mutant genotype in comparison to this. One possible parameterisation would be as follows.
library(limma)
genoData <- genoData %>%
mutate(genotype = factor(genotype, levels = c("WT", "Het", "Hom")))
X_wt_as_ref <- model.matrix(~genotype, data = genoData)
X_wt_as_ref
## (Intercept) genotypeHet genotypeHom
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 1 0
## 6 1 1 0
## 7 1 1 0
## 8 1 1 0
## 9 1 0 1
## 10 1 0 1
## 11 1 0 1
## 12 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$genotype
## [1] "contr.treatment"
Under this parameterisation, the columns genotypeHet
and genotypeHom
capture the differences between each mutant genotype and the WT samples (i.e. the intercept). To compare between genotypes, we could define a contrast matrix
makeContrasts(
HetVsWT = genotypeHet,
HomVsWT = genotypeHom,
HomVsHet = genotypeHom - genotypeHet,
levels = colnames(X_wt_as_ref)
)
## Contrasts
## Levels HetVsWT HomVsWT HomVsHet
## Intercept 0 0 0
## genotypeHet 1 0 -1
## genotypeHom 0 1 1
An approach in which no reference group was defined could be as follows
X_no_ref <- model.matrix(~0 + genotype, data = genoData)
X_no_ref
## genotypeWT genotypeHet genotypeHom
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 0 1 0
## 6 0 1 0
## 7 0 1 0
## 8 0 1 0
## 9 0 0 1
## 10 0 0 1
## 11 0 0 1
## 12 0 0 1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$genotype
## [1] "contr.treatment"
Under this strategy, we need to define our contrassts again
makeContrasts(
HetVsWT = genotypeHet - genotypeWT,
HomVsWT = genotypeHom - genotypeWT,
HomVsHet = genotypeHom - genotypeHet,
levels = colnames(X_no_ref)
)
## Contrasts
## Levels HetVsWT HomVsWT HomVsHet
## genotypeWT -1 -1 0
## genotypeHet 1 0 -1
## genotypeHom 0 1 1
A final option may be to define WT as the reference level, then include a column for the presence of a mutant allele, followed by a third indicating the effects of be homozygous mutant.
genoData <- genoData %>%
mutate(
mutant = str_detect(myGene, "-"),
homMut = str_detect(genotype, "Hom")
)
X_mut_geno <- model.matrix(~mutant + homMut, data = genoData)
X_mut_geno
## (Intercept) mutantTRUE homMutTRUE
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 1 0
## 6 1 1 0
## 7 1 1 0
## 8 1 1 0
## 9 1 1 1
## 10 1 1 1
## 11 1 1 1
## 12 1 1 1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$mutant
## [1] "contr.treatment"
##
## attr(,"contrasts")$homMut
## [1] "contr.treatment"
This parameterisation works very well in either a dominant or recessive context. In a dominant model most genes will be detected as DE by the column mutantTRUE
, whilst in a recessive model, most genes would be detected as DE by the column homMutTRUE
. This effectively captures the common effects of the mutation between genotypes, and where they differ. No contrast matrix is really necessary as this captures most of the information we need.
For this question all data was obtained from the public dataset located here, but has bene partially prepared and filtered for you. Data was generated using Illumina Microarrays and as such the assumption of normality is appropriate. Using the metadata, gene annotation and expression values contained in each of these three files:
ExpressionSet
, followed by a suitable design matrixInclude captions explaining each figure or table and ensure correctly labelled axes where appropriate.
For the most highly-ranked upregulated gene (based on the p-value), which sample group is it the most highly-expressed in?
library(tidyverse)
library(Biobase)
library(magrittr)
library(ggrepel)
library(pander)
exprs <- read_tsv("data/GSE71868_exprs.tsv") %>%
as.data.frame() %>%
column_to_rownames("ILMN_ID") %>%
as.matrix()
samples <- read_tsv("data/GSE71868_meta.txt") %>%
mutate(
age = factor(age, levels = c("6-8 weeks", "10-13 months")),
group = str_extract(sample, "(Middle-aged|Young)"),
group = factor(group, levels = c("Young", "Middle-aged"))
) %>%
as.data.frame() %>%
set_rownames(.$geo_accession) %>%
.[colnames(exprs),]
genes <- read_tsv("data/GSE71868_genes.tsv") %>%
as.data.frame() %>%
set_rownames(.$ID) %>%
.[rownames(exprs),]
eset <- ExpressionSet(
assayData = exprs,
phenoData = AnnotatedDataFrame(samples),
featureData = AnnotatedDataFrame(genes)
)
eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 14105 features, 8 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM1847063 GSM1847064 ... GSM1847070 (8 total)
## varLabels: sample geo_accession ... group (7 total)
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1212607 ILMN_1212619 ... ILMN_3163582 (14105
## total)
## fvarLabels: ID Source ... Symbol (5 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
# NB: you could also have used the age column
X <- model.matrix(~group, data = pData(eset))
X
## (Intercept) groupMiddle-aged
## GSM1847063 1 1
## GSM1847064 1 1
## GSM1847065 1 1
## GSM1847066 1 1
## GSM1847067 1 0
## GSM1847068 1 0
## GSM1847069 1 0
## GSM1847070 1 0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
fit <- lmFit(eset, X) %>%
eBayes()
results <- topTable(fit, coef = "groupMiddle-aged", number = Inf) %>%
as_tibble() %>%
mutate(DE = adj.P.Val < 0.05)
head(results)
## # A tibble: 6 x 12
## ID Source Source_Referenc… Entrez_Gene_ID Symbol logFC AveExpr t
## <chr> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 ILMN… RefSeq NM_181588.2 69574 23100… 2.98 6.68 29.4
## 2 ILMN… MEEBO scl0003397.1_5 NA Myo5a 1.58 6.59 23.2
## 3 ILMN… RefSeq NM_032003.1 83965 Enpp5 4.48 5.93 22.1
## 4 ILMN… RefSeq NM_010176.1 14085 Fah 1.55 6.61 18.2
## 5 ILMN… RefSeq NM_026334.1 67717 Lipf 1.54 9.55 17.7
## 6 ILMN… RefSeq NM_021877.2 15460 Hr 3.15 9.36 17.5
## # … with 4 more variables: P.Value <dbl>, adj.P.Val <dbl>, B <dbl>, DE <lgl>
results %>%
ggplot(aes(AveExpr, logFC)) +
geom_point(aes(colour = DE), alpha = 0.5) +
geom_smooth(se = FALSE) +
geom_hline(yintercept = c(-1, 1), colour = "blue", linetype = 2) +
geom_text_repel(
aes(colour = DE, label = Symbol),
data = . %>% dplyr::filter(logFC > 4),
show.legend = FALSE
) +
geom_text_repel(
aes(colour = DE, label = Symbol),
data = . %>% dplyr::filter(logFC < -2.5 & DE),
show.legend = FALSE
) +
scale_color_manual(values = c("grey50", "red")) +
theme_bw()
results %>%
ggplot(aes(logFC, -log10(P.Value))) +
geom_point(aes(colour = DE), alpha = 0.5) +
geom_vline(xintercept = c(-1, 1), colour = "blue", linetype = 2) +
geom_text_repel(
aes(colour = DE, label = Symbol),
data = . %>% dplyr::filter(P.Value < 1e-6),
show.legend = FALSE
) +
scale_color_manual(values = c("grey50", "red")) +
theme_bw()
results %>%
arrange(P.Value) %>%
dplyr::slice(1:10) %>%
dplyr::select(
ID, EntrezGene = Entrez_Gene_ID, Symbol, logFC, AveExpr, t, P.Value, FDR = adj.P.Val
) %>%
pander(
justify = "lllrrrrr",
caption = "The 10 most highly-ranked DE genes when comparing the two ages",
split.table = Inf
)
ID | EntrezGene | Symbol | logFC | AveExpr | t | P.Value | FDR |
---|---|---|---|---|---|---|---|
ILMN_2713673 | 69574 | 2310016A09Rik | 2.977 | 6.685 | 29.37 | 5.531e-09 | 7.802e-05 |
ILMN_1254736 | NA | Myo5a | 1.585 | 6.589 | 23.18 | 3.181e-08 | 0.0002119 |
ILMN_1217118 | 83965 | Enpp5 | 4.485 | 5.93 | 22.11 | 4.506e-08 | 0.0002119 |
ILMN_2614494 | 14085 | Fah | 1.552 | 6.606 | 18.2 | 1.876e-07 | 0.0005978 |
ILMN_2863532 | 67717 | Lipf | 1.537 | 9.55 | 17.65 | 2.348e-07 | 0.0005978 |
ILMN_2800687 | 15460 | Hr | 3.15 | 9.362 | 17.46 | 2.543e-07 | 0.0005978 |
ILMN_2635462 | NA | Cxcl15 | -1.4 | 6.991 | -17.03 | 3.049e-07 | 0.0006143 |
ILMN_2718266 | 14229 | Fkbp5 | -1.724 | 8.833 | -16.17 | 4.453e-07 | 0.0006855 |
ILMN_2947526 | 13601 | Ecm1 | 2.596 | 8.361 | 16.09 | 4.625e-07 | 0.0006855 |
ILMN_2740149 | 212862 | Chpt1 | 1.214 | 6.533 | 15.87 | 5.11e-07 | 0.0006855 |
Given our design matrix, an upregulated gene will be most highly expressed in the ‘Middle-aged’ samples, and this is the case for our most highly ranked gene, 2310016A09Rik
highest | median |
---|---|
9.4 | 8.3 |