Instructions

Submission Format [3 marks]

This is assignment is due by 5pm, Tuesday 12^th May.

Submissions must be made as a zip archive containing 2 files:
1. Your source R Markdown Document (with an Rmd suffix)
2. A compiled pdf, showing all code
All file names within the zip archive must start with your student number. However the name of the zip archive is not important as myUni will likely modify this during submission. See here for help creating a zip archive

All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.

Marks directly correspond to the amount of time and effort we expect for each question, so please answer with this is mind.

We strongly advise working in the folder ~/transcriptomics/assignment2 on your virtual machine. Using an R Project for each individual assignment is also strongly advised.

Creating a zip archive

On Your VM

If all files required for submission are contained on your VM:

Select both files using the Files pane in R Studio
Click export
They will automatically be placed into a single zip archive. Please name this in whatever informative name you decide is suitable, but it should contain the suffix .zip

Windows

If all files are on your on your local Windows machine:

Using File Explorer, enter the folder containing both files
Select all files simultaneously by using Ctrl + Click
Right click on one of the files and select Send to > Compressed (zipped) folder
Rename as appropriate, ensuring the archive ends with the suffix .zip

Mac OS

If all the files are on your *local macOS machine`:

Locate the items to zip in the Mac Finder (file system)
Right-click on a file, folder, or files you want to zip
Select “Compress Items”
Find the newly created .zip archive in the same directory and name as appropriate

Questions

Question 1 [8 marks]

Two common strategies for RNA Seq library preparation are the depletion of rRNA molecules or the preferential amplification of poly-adenylated RNA. Briefly contrast these two approaches, describing their respective strengths and limitations.

Question 2 [10 marks]

A common alignment and quantification workflow is to align an RNA Seq sample to a reference genome and count reads which align to each gene, as defined in a gtf file.

Describe two important considerations during the quantification step, and how these may differ for polyA and total RNA libraries?
What advantages and disadvantages does this approach offer in comparison to using pseudo-alignment to a reference transcriptome

Question 3 [8 marks]

A transcriptomic experiment was designed to test differences in gene expresson based on loss-of-function mutations in a specific gene (myGene). Three genotypes: Wild-Type, Heterozygous and Homozygous were analysed, and these may also be described as myGene^+/+, myGene^+/- and myGene^-/-. For an experiment with \(n = 4\) samples from each genotype, describe two possible approaches detailing advantages of each over the other. Include code for generating a model matrix and a contrast matrix based on the following layout.

You should start the coding section by copying the following code to generate the appropriate metadata object.

genoData <- tibble(
  sampleID = paste0("S", 1:12),
  replicate = rep(1:4, 3),
  myGene = rep(c("+/+", "+/-", "-/-"), each = 4),
  genotype = rep(c("WT", "Het", "Hom"), each = 4)
)

(Hint: Consider how to set a categorical variable in R)

Question 4 [18 marks]

For this question all data was obtained from the public dataset located here, but has bene partially prepared and filtered for you. Data was generated using Illumina Microarrays and as such the assumption of normality is appropriate. Using the metadata, gene annotation and expression values contained in each of these three files:

Form an ExpressionSet, followed by a suitable design matrix
Perform a differential expression analysis based on your design matrix
Generate MA and Volcano plots, labelling any genes you think are appropriate
Produce a table of the top 10 most highly ranked genes from this analysis

Include captions explaining each figure or table and ensure correctly labelled axes where appropriate.

For the most highly-ranked upregulated gene (based on the p-value), which sample group is it the most highly-expressed in?

Transcriptomics Applications

Assignment 2