This is assignment is due by 5pm, Tuesday 12th May.
All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.
Marks directly correspond to the amount of time and effort we expect for each question, so please answer with this is mind.
We strongly advise working in the folder ~/transcriptomics/assignment2
on your virtual machine. Using an R Project for each individual assignment is also strongly advised.
If all files required for submission are contained on your VM:
.zip
If all files are on your on your local Windows machine:
Send to > Compressed (zipped) folder
.zip
If all the files are on your *local macOS machine`:
Two common strategies for RNA Seq library preparation are the depletion of rRNA molecules or the preferential amplification of poly-adenylated RNA. Briefly contrast these two approaches, describing their respective strengths and limitations.
A common alignment and quantification workflow is to align an RNA Seq sample to a reference genome and count reads which align to each gene, as defined in a gtf
file.
A transcriptomic experiment was designed to test differences in gene expresson based on loss-of-function mutations in a specific gene (myGene). Three genotypes: Wild-Type, Heterozygous and Homozygous were analysed, and these may also be described as myGene+/+, myGene+/- and myGene-/-. For an experiment with \(n = 4\) samples from each genotype, describe two possible approaches detailing advantages of each over the other. Include code for generating a model matrix and a contrast matrix based on the following layout.
You should start the coding section by copying the following code to generate the appropriate metadata object.
genoData <- tibble(
sampleID = paste0("S", 1:12),
replicate = rep(1:4, 3),
myGene = rep(c("+/+", "+/-", "-/-"), each = 4),
genotype = rep(c("WT", "Het", "Hom"), each = 4)
)
(Hint: Consider how to set a categorical variable in R)
For this question all data was obtained from the public dataset located here, but has bene partially prepared and filtered for you. Data was generated using Illumina Microarrays and as such the assumption of normality is appropriate. Using the metadata, gene annotation and expression values contained in each of these three files:
ExpressionSet
, followed by a suitable design matrixInclude captions explaining each figure or table and ensure correctly labelled axes where appropriate.
For the most highly-ranked upregulated gene (based on the p-value), which sample group is it the most highly-expressed in?