Instructions

Submission Format [3 marks]

This is assignment is due by 5pm, Tuesday 26th May.

  • Submissions must be made as a zip archive containing 2 files:
    1. Your source R Markdown Document (with an Rmd suffix)
    2. A compiled pdf, showing all code
  • All file names within the zip archive must start with your student number. However the name of the zip archive is not important as myUni will likely modify this during submission. See here for help creating a zip archive

All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.

Marks directly correspond to the amount of time and effort we expect for each question, so please answer with this is mind.

We strongly advise working in the folder ~/transcriptomics/assignment4 on your virtual machine. Using an R Project for each individual assignment is also strongly advised.

Creating a zip archive

On Your VM

If all files required for submission are contained on your VM:

  1. Select both files using the Files pane in R Studio
  2. Click export
  3. They will automatically be placed into a single zip archive. Please name this in whatever informative name you decide is suitable, but it should contain the suffix .zip

Windows

If all files are on your on your local Windows machine:

  1. Using File Explorer, enter the folder containing both files
  2. Select all files simultaneously by using Ctrl + Click
  3. Right click on one of the files and select Send to > Compressed (zipped) folder
  4. Rename as appropriate, ensuring the archive ends with the suffix .zip

Mac OS

If all the files are on your *local macOS machine`:

  1. Locate the items to zip in the Mac Finder (file system)
  2. Right-click on a file, folder, or files you want to zip
  3. Select “Compress Items”
  4. Find the newly created .zip archive in the same directory and name as appropriate

Questions

Question 1 [4 marks]

For microarray data we are able to assume normality, whilst for RNA-seq we are not. Describe briefly why this is and what types of distributions and models we are able to use instead.

Question 2 [4 marks]

Normalisation in RNA-seq can be performed using offsets in the model fitting stages, with two of the most common approaches being TMM and CQN. Briefly describe which technical issues the two methods are attempting to overcome. Whilst there is much mathematical detail in these papers which is beyond the scope of this course, the papers describing the two approaches are available here for TMM and here for CQN. Feel free to use these papers whilst formulating your answer.

Question 3 [8 marks]

source("https://uofabioinformaticshub.github.io/transcriptomics_applications/assignments/A4Funs.R")
makeGsTestData("a1234567")

For this question, please run the above code to generate your test dataset, using your own student ID instead of the one provided. This will place an object called gsTestData in your R Environment. Once you have this object, please perform the following tasks

  1. Manipulate the data to be a \(2\times2\) matrix
  2. Print the matrix in your RMarkdown using pander(). Missing column names in the first printed column are expected
  3. Conduct Fisher’s Exact Test on this matrix using fisher.test() and provide the output in your RMarkdown
  4. Interpret the results including a description of \(H_0\) and your final conclusion regarding an enrichment of the gene-set in your DE genes.

Question 4 [15 marks]

For this question you will need the dataset provided for you here, which is a comparison of splenic and pancreatic Tregs from Mus musculus.

  1. Load this data into R as an object called topTable.
  2. Decide on which genes to consider as differentially expressed (DE). You may choose any suitable criteria such as an FDR, or an FDR in combination with filtering based on logFC or simply choose the top n-ranked genes by p-value. Explain your reasoning.
  3. Assess your DE genes for GC and Length bias as outlined in the goseq package. If any discernible pattern is present choose one of these as the bias you are choosing to offset and explain why.
  4. Conduct an enrichment analysis using goseq() choosing one of the collections available in the msigdbr package. Justify any filtering of gene-sets that you undertake, or why you have chosen not to. A list of possible collections is provided on the help page for msigdbr. Please note that is using the categories C2, C3 or C5 you will be expected to choose only one of the subcategories. A full list with complete descriptions is available here
  5. Present your results as a table of the most highly-ranked genesets, and if possible provide the top n-ranked genes within each geneset as a separate column. Summarise your results very briefly in plain text.

If working with others in a group, please ensure that you choose different gene-set collections for your submission.

Total: 34 marks