Instructions

Submission Format [3 marks]

This is assignment is due by 5pm, Tuesday 12^th May.

Submissions must be made as a zip archive containing 2 files:
1. Your source R Markdown Document (with an Rmd suffix)
2. A compiled pdf, showing all code
All file names within the zip archive must start with your student number. However the name of the zip archive is not important as myUni will likely modify this during submission. See here for help creating a zip archive

All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.

Marks directly correspond to the amount of time and effort we expect for each question, so please answer with this is mind.

We strongly advise working in the folder ~/transcriptomics/assignment2 on your virtual machine. Using an R Project for each individual assignment is also strongly advised.

Creating a zip archive

On Your VM

If all files required for submission are contained on your VM:

Select both files using the Files pane in R Studio
Click export
They will automatically be placed into a single zip archive. Please name this in whatever informative name you decide is suitable, but it should contain the suffix .zip

Windows

If all files are on your on your local Windows machine:

Using File Explorer, enter the folder containing both files
Select all files simultaneously by using Ctrl + Click
Right click on one of the files and select Send to > Compressed (zipped) folder
Rename as appropriate, ensuring the archive ends with the suffix .zip

Mac OS

If all the files are on your *local macOS machine`:

Locate the items to zip in the Mac Finder (file system)
Right-click on a file, folder, or files you want to zip
Select “Compress Items”
Find the newly created .zip archive in the same directory and name as appropriate

Questions

Question 1 [8 marks]

Two common strategies for RNA Seq library preparation are the depletion of rRNA molecules or the preferential amplification of poly-adenylated RNA. Briefly contrast these two approaches, describing their respective strengths and limitations.

Any of the following points up to the total of 8 marks will be accepted, along with any other suitable comparisons

totalRNA (rRNA-depletion) actively removes RNA molecules by one of a number of strategies, using sequence identity to remove molecules such as rRNA which are not of experimental interest
polyA techniques use an amplification strategy in which polyA/T sequences are targeted for amplification
polyA techniques will preferentially amplify mature mRNAs with a polyA tail, ensuring mature mRNAs form the majority of the sequenced molecules. rRNA-reduction strategies will include immature (i.e. unprocessed) transcripts in their pool of amplified molecules
polyA amplification will include any molecules with a polyA tail, but will miss molecules such as those ncRNA without a polyA tail. These are captured by rRNA-reduction strategies however
rRNA-reduction can be highly variable between samples within the same experiment adding noise

Question 2 [10 marks]

A common alignment and quantification workflow is to align an RNA Seq sample to a reference genome and count reads which align to each gene, as defined in a gtf file.

Describe two important considerations during the quantification step, and how these may differ for polyA and total RNA libraries?
What advantages and disadvantages does this approach offer in comparison to using pseudo-alignment to a reference transcriptome

Do we count reads which align strictly to exonic regions, or do we include reads that partially overlap introns? For total-RNA we may have unspliced transcripts, whereas we probably won’t for polyA libraries
Reads can often align to multiple locations, so how do we manage multi-mapped reads. This should be the same for either library preparation method

Using a gene-centric model enables easier visualisation and detection of unusual splice variation.
Novel transcripts are more easily detected as we’re not limited to known transcripts or even to known exons
Using a transcriptome as our reference is faster, and gives transcript-level expression estimates.

Question 3 [8 marks]

A transcriptomic experiment was designed to test differences in gene expresson based on loss-of-function mutations in a specific gene (myGene). Three genotypes: Wild-Type, Heterozygous and Homozygous were analysed, and these may also be described as myGene^+/+, myGene^+/- and myGene^-/-. For an experiment with \(n = 4\) samples from each genotype, describe two possible approaches detailing advantages of each over the other. Include code for generating a model matrix and a contrast matrix based on the following layout.

You should start the coding section by copying the following code to generate the appropriate metadata object.

genoData <- tibble(
  sampleID = paste0("S", 1:12),
  replicate = rep(1:4, 3),
  myGene = rep(c("+/+", "+/-", "-/-"), each = 4),
  genotype = rep(c("WT", "Het", "Hom"), each = 4)
)

(Hint: Consider how to set a categorical variable in R)

We could take a few approaches here, with two of these being setting a reference genotype, such as the wild-type samples, and assessing the impact of the mutant genotype in comparison to this. One possible parameterisation would be as follows.

library(limma)
genoData <- genoData %>%
  mutate(genotype = factor(genotype, levels = c("WT", "Het", "Hom")))
X_wt_as_ref <- model.matrix(~genotype, data = genoData)
X_wt_as_ref

##    (Intercept) genotypeHet genotypeHom
## 1            1           0           0
## 2            1           0           0
## 3            1           0           0
## 4            1           0           0
## 5            1           1           0
## 6            1           1           0
## 7            1           1           0
## 8            1           1           0
## 9            1           0           1
## 10           1           0           1
## 11           1           0           1
## 12           1           0           1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$genotype
## [1] "contr.treatment"

Under this parameterisation, the columns genotypeHet and genotypeHom capture the differences between each mutant genotype and the WT samples (i.e. the intercept). To compare between genotypes, we could define a contrast matrix

makeContrasts(
  HetVsWT = genotypeHet,
  HomVsWT = genotypeHom,
  HomVsHet = genotypeHom - genotypeHet,
  levels = colnames(X_wt_as_ref)
)

##              Contrasts
## Levels        HetVsWT HomVsWT HomVsHet
##   Intercept         0       0        0
##   genotypeHet       1       0       -1
##   genotypeHom       0       1        1

An approach in which no reference group was defined could be as follows

X_no_ref <- model.matrix(~0 + genotype, data = genoData)
X_no_ref

##    genotypeWT genotypeHet genotypeHom
## 1           1           0           0
## 2           1           0           0
## 3           1           0           0
## 4           1           0           0
## 5           0           1           0
## 6           0           1           0
## 7           0           1           0
## 8           0           1           0
## 9           0           0           1
## 10          0           0           1
## 11          0           0           1
## 12          0           0           1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$genotype
## [1] "contr.treatment"

Under this strategy, we need to define our contrassts again

makeContrasts(
  HetVsWT = genotypeHet - genotypeWT,
  HomVsWT = genotypeHom - genotypeWT,
  HomVsHet = genotypeHom - genotypeHet,
  levels = colnames(X_no_ref)
)

##              Contrasts
## Levels        HetVsWT HomVsWT HomVsHet
##   genotypeWT       -1      -1        0
##   genotypeHet       1       0       -1
##   genotypeHom       0       1        1

A final option may be to define WT as the reference level, then include a column for the presence of a mutant allele, followed by a third indicating the effects of be homozygous mutant.

genoData <- genoData %>%
  mutate(
    mutant = str_detect(myGene, "-"),
    homMut = str_detect(genotype, "Hom")
  ) 
X_mut_geno <- model.matrix(~mutant + homMut, data = genoData)
X_mut_geno

##    (Intercept) mutantTRUE homMutTRUE
## 1            1          0          0
## 2            1          0          0
## 3            1          0          0
## 4            1          0          0
## 5            1          1          0
## 6            1          1          0
## 7            1          1          0
## 8            1          1          0
## 9            1          1          1
## 10           1          1          1
## 11           1          1          1
## 12           1          1          1
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$mutant
## [1] "contr.treatment"
## 
## attr(,"contrasts")$homMut
## [1] "contr.treatment"

This parameterisation works very well in either a dominant or recessive context. In a dominant model most genes will be detected as DE by the column mutantTRUE, whilst in a recessive model, most genes would be detected as DE by the column homMutTRUE. This effectively captures the common effects of the mutation between genotypes, and where they differ. No contrast matrix is really necessary as this captures most of the information we need.

Question 4 [18 marks]

For this question all data was obtained from the public dataset located here, but has bene partially prepared and filtered for you. Data was generated using Illumina Microarrays and as such the assumption of normality is appropriate. Using the metadata, gene annotation and expression values contained in each of these three files:

Form an ExpressionSet, followed by a suitable design matrix
Perform a differential expression analysis based on your design matrix
Generate MA and Volcano plots, labelling any genes you think are appropriate
Produce a table of the top 10 most highly ranked genes from this analysis

Include captions explaining each figure or table and ensure correctly labelled axes where appropriate.

For the most highly-ranked upregulated gene (based on the p-value), which sample group is it the most highly-expressed in?

library(tidyverse)
library(Biobase)
library(magrittr)
library(ggrepel)
library(pander)
exprs <- read_tsv("data/GSE71868_exprs.tsv") %>%
  as.data.frame() %>%
  column_to_rownames("ILMN_ID") %>%
  as.matrix()
samples <- read_tsv("data/GSE71868_meta.txt") %>%
  mutate(
    age = factor(age, levels = c("6-8 weeks", "10-13 months")),
    group = str_extract(sample, "(Middle-aged|Young)"),
    group = factor(group, levels = c("Young", "Middle-aged"))
  ) %>%
  as.data.frame() %>%
  set_rownames(.$geo_accession) %>%
  .[colnames(exprs),]
genes <- read_tsv("data/GSE71868_genes.tsv") %>%
  as.data.frame() %>%
  set_rownames(.$ID) %>%
  .[rownames(exprs),]
eset <- ExpressionSet(
  assayData = exprs,
  phenoData = AnnotatedDataFrame(samples),
  featureData = AnnotatedDataFrame(genes)
)
eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 14105 features, 8 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM1847063 GSM1847064 ... GSM1847070 (8 total)
##   varLabels: sample geo_accession ... group (7 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1212607 ILMN_1212619 ... ILMN_3163582 (14105
##     total)
##   fvarLabels: ID Source ... Symbol (5 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

# NB: you could also have used the age column
X <- model.matrix(~group, data = pData(eset))
X

##            (Intercept) groupMiddle-aged
## GSM1847063           1                1
## GSM1847064           1                1
## GSM1847065           1                1
## GSM1847066           1                1
## GSM1847067           1                0
## GSM1847068           1                0
## GSM1847069           1                0
## GSM1847070           1                0
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"

fit <- lmFit(eset, X) %>%
  eBayes()
results <- topTable(fit, coef = "groupMiddle-aged", number = Inf) %>%
  as_tibble() %>%
  mutate(DE = adj.P.Val < 0.05)
head(results)

## # A tibble: 6 x 12
##   ID    Source Source_Referenc… Entrez_Gene_ID Symbol logFC AveExpr     t
##   <chr> <chr>  <chr>                     <dbl> <chr>  <dbl>   <dbl> <dbl>
## 1 ILMN… RefSeq NM_181588.2               69574 23100…  2.98    6.68  29.4
## 2 ILMN… MEEBO  scl0003397.1_5               NA Myo5a   1.58    6.59  23.2
## 3 ILMN… RefSeq NM_032003.1               83965 Enpp5   4.48    5.93  22.1
## 4 ILMN… RefSeq NM_010176.1               14085 Fah     1.55    6.61  18.2
## 5 ILMN… RefSeq NM_026334.1               67717 Lipf    1.54    9.55  17.7
## 6 ILMN… RefSeq NM_021877.2               15460 Hr      3.15    9.36  17.5
## # … with 4 more variables: P.Value <dbl>, adj.P.Val <dbl>, B <dbl>, DE <lgl>

results %>%
  ggplot(aes(AveExpr, logFC)) +
  geom_point(aes(colour = DE), alpha = 0.5) +
  geom_smooth(se = FALSE) +
  geom_hline(yintercept = c(-1, 1), colour = "blue", linetype = 2) +
  geom_text_repel(
    aes(colour = DE, label = Symbol),
    data = . %>% dplyr::filter(logFC > 4),
    show.legend = FALSE
  ) +
  geom_text_repel(
    aes(colour = DE, label = Symbol),
    data = . %>% dplyr::filter(logFC < -2.5 & DE),
    show.legend = FALSE
  ) +
  scale_color_manual(values = c("grey50", "red")) +
  theme_bw()

results %>%
  ggplot(aes(logFC, -log10(P.Value))) +
  geom_point(aes(colour = DE), alpha = 0.5) +
  geom_vline(xintercept = c(-1, 1), colour = "blue", linetype = 2) +
  geom_text_repel(
    aes(colour = DE, label = Symbol),
    data = . %>% dplyr::filter(P.Value < 1e-6),
    show.legend = FALSE
  ) +
  scale_color_manual(values = c("grey50", "red")) +
  theme_bw()

results %>%
  arrange(P.Value) %>%
  dplyr::slice(1:10) %>%
  dplyr::select(
    ID, EntrezGene = Entrez_Gene_ID, Symbol, logFC, AveExpr, t, P.Value, FDR = adj.P.Val
  ) %>%
  pander(
    justify = "lllrrrrr",
    caption = "The 10 most highly-ranked DE genes when comparing the two ages",
    split.table = Inf
  )

The 10 most highly-ranked DE genes when comparing the two ages
ID	EntrezGene	Symbol	logFC	AveExpr	t	P.Value	FDR
ILMN_2713673	69574	2310016A09Rik	2.977	6.685	29.37	5.531e-09	7.802e-05
ILMN_1254736	NA	Myo5a	1.585	6.589	23.18	3.181e-08	0.0002119
ILMN_1217118	83965	Enpp5	4.485	5.93	22.11	4.506e-08	0.0002119
ILMN_2614494	14085	Fah	1.552	6.606	18.2	1.876e-07	0.0005978
ILMN_2863532	67717	Lipf	1.537	9.55	17.65	2.348e-07	0.0005978
ILMN_2800687	15460	Hr	3.15	9.362	17.46	2.543e-07	0.0005978
ILMN_2635462	NA	Cxcl15	-1.4	6.991	-17.03	3.049e-07	0.0006143
ILMN_2718266	14229	Fkbp5	-1.724	8.833	-16.17	4.453e-07	0.0006855
ILMN_2947526	13601	Ecm1	2.596	8.361	16.09	4.625e-07	0.0006855
ILMN_2740149	212862	Chpt1	1.214	6.533	15.87	5.11e-07	0.0006855

Given our design matrix, an upregulated gene will be most highly expressed in the ‘Middle-aged’ samples, and this is the case for our most highly ranked gene, 2310016A09Rik

Total: 47 marks

Summary of grades for assessment 3 2020, scaled to be out of a possible 10 marks
highest	median
9.4	8.3

Transcriptomics Applications

Assignment 3 Solutions