Instructions

Submission Format [3 marks]

This is assignment is due by 5pm, Tuesday 26^th May.

Submissions must be made as a zip archive containing 2 files:
1. Your source R Markdown Document (with an Rmd suffix)
2. A compiled pdf, showing all code
All file names within the zip archive must start with your student number. However the name of the zip archive is not important as myUni will likely modify this during submission. See here for help creating a zip archive

All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.

Marks directly correspond to the amount of time and effort we expect for each question, so please answer with this is mind.

We strongly advise working in the folder ~/transcriptomics/assignment4 on your virtual machine. Using an R Project for each individual assignment is also strongly advised.

Creating a zip archive

On Your VM

If all files required for submission are contained on your VM:

Select both files using the Files pane in R Studio
Click export
They will automatically be placed into a single zip archive. Please name this in whatever informative name you decide is suitable, but it should contain the suffix .zip

Windows

If all files are on your on your local Windows machine:

Using File Explorer, enter the folder containing both files
Select all files simultaneously by using Ctrl + Click
Right click on one of the files and select Send to > Compressed (zipped) folder
Rename as appropriate, ensuring the archive ends with the suffix .zip

Mac OS

If all the files are on your *local macOS machine`:

Locate the items to zip in the Mac Finder (file system)
Right-click on a file, folder, or files you want to zip
Select “Compress Items”
Find the newly created .zip archive in the same directory and name as appropriate

Questions

Question 1 [4 marks]

For microarray data we are able to assume normality, whilst for RNA-seq we are not. Describe briefly why this is and what types of distributions and models we are able to use instead.

Microarray data is a continuous measurement whilst RNA-Seq is measured in counts
Counts can be modelled using either a Poisson or a Negative Binomial model to determine the rate of occurence in a fixed unit of measurement
For microarray data, the mean (i.e. expression level) and variance are able to be assumed as independent
For RNA-Seq the mean and variance are not independent under either a Poisson or Negative-Binomial model

Question 2 [4 marks]

Normalisation in RNA-seq can be performed using offsets in the model fitting stages, with two of the most common approaches being TMM and CQN. Briefly describe which technical issues the two methods are attempting to overcome. Whilst there is much mathematical detail in these papers which is beyond the scope of this course, the papers describing the two approaches are available here for TMM and here for CQN. Feel free to use these papers whilst formulating your answer.

TMM normalisation helps to deal with variable library sizes. By trimming data, this also helps account for libraries which are dominated by one or two highly-expressed genes
CQN normalisation not only attempts to deal with the same issues as above, but also helps to account for any sample-specific biases due to GC-content or gene length.

Question 3 [8 marks]

source("https://uofabioinformaticshub.github.io/transcriptomics_applications/assignments/A4Funs.R")
makeGsTestData("a1234567")

For this question, please run the above code to generate your test dataset, using your own student ID instead of the one provided. This will place an object called gsTestData in your R Environment. Once you have this object, please perform the following tasks

Manipulate the data to be a \(2\times2\) matrix
Print the matrix in your RMarkdown using pander(). Missing column names in the first printed column are expected
Conduct Fisher’s Exact Test on this matrix using fisher.test() and provide the output in your RMarkdown
Interpret the results including a description of \(H_0\) and your final conclusion regarding an enrichment of the gene-set in your DE genes.

library(tidyverse)
library(pander)
myMat <- gsTestData %>%
  pivot_wider(
    id_cols = inGeneSet,
    names_from = DE,
    values_from = Total,
    names_prefix = "DE_"
  ) %>%
  as.data.frame() %>%
  column_to_rownames("inGeneSet") %>%
  as.matrix()
myMat

##       DE_TRUE DE_FALSE
## TRUE       73      435
## FALSE     517     3056

pander(myMat)

	DE_TRUE	DE_FALSE
TRUE	73	435
FALSE	517	3056

fisher.test(myMat)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  myMat
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7502549 1.2977600
## sample estimates:
## odds ratio 
##  0.9919664

The above tests the hypothesis:
\(H_0\): There is no association between DE status and the Gene Set, vs
\(H_A\): There is an association between DE status and the Gene Set

As \(p \approx 1\) we would accept the null hypothesis and conclude that there is no association between this gene set and our set of DE genes.

Question 4 [15 marks]

For this question you will need the dataset provided for you here, which is a comparison of splenic and pancreatic Tregs from Mus musculus.

Load this data into R as an object called topTable.
Decide on which genes to consider as differentially expressed (DE). You may choose any suitable criteria such as an FDR, or an FDR in combination with filtering based on logFC or simply choose the top n-ranked genes by p-value. Explain your reasoning.
Assess your DE genes for GC and Length bias as outlined in the goseq package. If any discernible pattern is present choose one of these as the bias you are choosing to offset and explain why.
Conduct an enrichment analysis using goseq() choosing one of the collections available in the msigdbr package. Justify any filtering of gene-sets that you undertake, or why you have chosen not to. A list of possible collections is provided on the help page for msigdbr. Please note that is using the categories C2, C3 or C5 you will be expected to choose only one of the subcategories. A full list with complete descriptions is available here
Present your results as a table of the most highly-ranked genesets, and if possible provide the top n-ranked genes within each geneset as a separate column. Summarise your results very briefly in plain text.

If working with others in a group, please ensure that you choose different gene-set collections for your submission.

library(goseq)
library(msigdbr)
library(magrittr)
topTable <- "https://uofabioinformaticshub.github.io/transcriptomics_applications/assignments/data/topTable.csv" %>%
  url() %>%
  read_csv()
head(topTable)

## # A tibble: 6 x 9
##   gene_id         gene_name logFC logCPM   PValue      FDR entrezid aveLen aveGc
##   <chr>           <chr>     <dbl>  <dbl>    <dbl>    <dbl>    <dbl>  <dbl> <dbl>
## 1 ENSMUSG0000002… Cela1      7.16   9.02 3.17e-69 3.36e-65   109901   639.  54.0
## 2 ENSMUSG0000003… Tent5a     1.30   7.06 1.08e-64 5.74e-61   212943  5564.  43.2
## 3 ENSMUSG0000004… Ctla2a     2.67   5.06 6.42e-55 2.27e-51    13024   792.  45.1
## 4 ENSMUSG0000002… Fosl2      1.86   5.12 2.06e-52 5.47e-49    14284  3905.  53.6
## 5 ENSMUSG0000002… Ttc39c     1.22   5.15 1.23e-34 2.61e-31    72747  1977.  49.9
## 6 ENSMUSG0000004… Camk2n1    1.98   4.86 1.20e-31 2.13e-28    66259  4444   53.7

minLFC <- 0.5
fdr <- 0.01

In order to decide on which genes were DE, I first decided to make a volcano plot. After a few experiments, I decided on an FDR of 0.01 and a logFC filter of 0.5 for considering a gene as DE. This seemed a fair compromise between accepting genes with a small amount of differential expression and being statistically conservative.

topTable$DE <- topTable$FDR < fdr & abs(topTable$logFC) > minLFC
topTable %>%
  ggplot(aes(logFC, -log10(PValue), colour = DE)) +
  geom_point() +
  geom_vline(xintercept = c(-1, 1)* minLFC, linetype = 2) +
  scale_colour_manual(values = c("grey", "red")) +
  scale_x_continuous(breaks = seq(-4, 8, by = 2)) +
  theme_bw()

This gave a total of 639 DE genes.

deVec <- setNames(topTable$DE, topTable$gene_id)
head(deVec)

## ENSMUSG00000023031 ENSMUSG00000032265 ENSMUSG00000044258 ENSMUSG00000029135 
##               TRUE               TRUE               TRUE               TRUE 
## ENSMUSG00000024424 ENSMUSG00000046447 
##               TRUE               TRUE

pwfLen <- nullp(deVec, bias.data = topTable$aveLen, plot.fit = FALSE)
pwfGC <- nullp(deVec, bias.data = topTable$aveGc, plot.fit = FALSE)
par(mfrow = c(1, 2))
plotPWF(pwfLen, xlab = "Gene Length")
plotPWF(pwfGC, xlab = "GC Content")

Comparison of DE bias for genes using both a) gene length, and b) GC content. Both showed a little bias, however gene-length was chosen as the bias for the Wallenius Approximation in goseq()

par(mfrow = c(1, 1))

For enrichment analysis, I’m going to choose the KEGG gene-sets

kg <- msigdbr(species = "Mus musculus", category = "C2", subcategory = "CP:KEGG") %>%
  inner_join(
    dplyr::select(topTable, gene_id, entrez_gene = entrezid)
  )
kgByGene <- kg %>%
  dplyr::distinct(gs_name, gene_id) %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
head(kgByGene)

## $ENSMUSG00000000001
## [1] "KEGG_AXON_GUIDANCE"                          
## [2] "KEGG_CHEMOKINE_SIGNALING_PATHWAY"            
## [3] "KEGG_GAP_JUNCTION"                           
## [4] "KEGG_LEUKOCYTE_TRANSENDOTHELIAL_MIGRATION"   
## [5] "KEGG_LONG_TERM_DEPRESSION"                   
## [6] "KEGG_MELANOGENESIS"                          
## [7] "KEGG_PROGESTERONE_MEDIATED_OOCYTE_MATURATION"
## [8] "KEGG_TIGHT_JUNCTION"                         
## 
## $ENSMUSG00000000028
## [1] "KEGG_CELL_CYCLE"
## 
## $ENSMUSG00000000088
## [1] "KEGG_ALZHEIMERS_DISEASE"         "KEGG_CARDIAC_MUSCLE_CONTRACTION"
## [3] "KEGG_HUNTINGTONS_DISEASE"        "KEGG_OXIDATIVE_PHOSPHORYLATION" 
## [5] "KEGG_PARKINSONS_DISEASE"        
## 
## $ENSMUSG00000000142
## [1] "KEGG_BASAL_CELL_CARCINOMA"  "KEGG_COLORECTAL_CANCER"    
## [3] "KEGG_ENDOMETRIAL_CANCER"    "KEGG_PATHWAYS_IN_CANCER"   
## [5] "KEGG_WNT_SIGNALING_PATHWAY"
## 
## $ENSMUSG00000000149
## [1] "KEGG_LONG_TERM_DEPRESSION"              
## [2] "KEGG_MAPK_SIGNALING_PATHWAY"            
## [3] "KEGG_REGULATION_OF_ACTIN_CYTOSKELETON"  
## [4] "KEGG_VASCULAR_SMOOTH_MUSCLE_CONTRACTION"
## 
## $ENSMUSG00000000168
## [1] "KEGG_CITRATE_CYCLE_TCA_CYCLE"    "KEGG_GLYCOLYSIS_GLUCONEOGENESIS"
## [3] "KEGG_PYRUVATE_METABOLISM"

goseqRes <- goseq(pwfLen, gene2cat = kgByGene) %>%
  as_tibble() %>%
  mutate(
    Expected = round(sum(topTable$DE)*numInCat / length(kgByGene), 1),
    FDR = p.adjust(over_represented_pvalue, "fdr")
  ) %>%
  dplyr::select(
    category,
    Expected,
    nDE = numDEInCat, 
    nGenes = numInCat, 
    p = over_represented_pvalue,
    FDR
  ) 
goseqRes %>%
  dplyr::filter(FDR < 0.01) %>%
  left_join(kg, by = c("category" = "gs_name")) %>%
  dplyr::select(
    any_of(colnames(goseqRes)), gene_symbol, gene_id
  ) %>%
  dplyr::filter(gene_id %in% names(which(deVec))) %>%
  dplyr::select(-gene_id) %>%
  chop(gene_symbol) %>%
  mutate(
    gene_symbol = vapply(gene_symbol, paste, character(1), collapse = "; "),
    category = str_remove_all(category, "KEGG_"),
    category = str_replace_all(category, "_", " "),
    category = str_to_title(category),
    p = sprintf("%.2e", p),
    FDR = sprintf("%.2e", FDR)
  ) %>%
  dplyr::rename(
    Category = category,
    Genes = gene_symbol
  ) %>%
  pander(
    justify = "lrrrrrl",
    split.tables = Inf,
    split.cells = 30,
    caption = "Enriched KEGG pathways to an FDR of 0.01, with DE genes from each pathway shown in the final column"
  )

Enriched KEGG pathways to an FDR of 0.01, with DE genes from each pathway shown in the final column
Category	Expected	nDE	nGenes	p	FDR	Genes
Cytokine Cytokine Receptor Interaction	19.1	24	85	1.67e-09	3.09e-07	Acvr2a; Ccl5; Ccr2; Ccr3; Ccr4; Ccr5; Ccr7; Ccr9; Csf2ra; Cxcr4; Fasl; Ifngr1; Il10; Il17rb; Il18rap; Il2rb; Il3ra; Il6ra; Il9r; Tgfb3; Tnfrsf10b; Tnfrsf10b; Tnfrsf10b; Tnfrsf10b; Tnfsf11; Tnfsf14; Tnfsf8
Hematopoietic Cell Lineage	7.9	15	35	3.94e-09	3.64e-07	Cd44; Cd7; Cd8a; Cd8b1; Cd9; Csf2ra; Il3ra; Il6ra; Il9r; Itga1; Itga2; Itga3; Itga4; Itga6; Itgb3
Arrhythmogenic Right Ventricular Cardiomyopathy Arvc	7.2	14	32	1.76e-08	1.08e-06	Actn1; Cacnb3; Cacnb4; Ctnna1; Dsg2; Itga1; Itga2; Itga3; Itga4; Itga6; Itgb3; Itgb5; Lmna; Tcf7
Ecm Receptor Interaction	6.7	13	30	9.33e-08	4.31e-06	Cd44; Col1a1; Col1a2; Col6a2; Itga1; Itga2; Itga3; Itga4; Itga6; Itgb3; Itgb5; Lama5; Sdc4
Neuroactive Ligand Receptor Interaction	9	13	40	3.42e-06	1.27e-04	Chrnb1; Chrnb2; Cysltr1; Cysltr2; Glp1r; Gpr83; Ltb4r1; Nr3c1; P2ry14; Ptafr; Ptgir; Tbxa2r; Vipr1
Dilated Cardiomyopathy	8.7	12	39	1.65e-05	5.08e-04	Adcy7; Cacnb3; Cacnb4; Itga1; Itga2; Itga3; Itga4; Itga6; Itgb3; Itgb5; Lmna; Tgfb3
Hypertrophic Cardiomyopathy Hcm	8.3	11	37	4.46e-05	1.18e-03	Cacnb3; Cacnb4; Itga1; Itga2; Itga3; Itga4; Itga6; Itgb3; Itgb5; Lmna; Tgfb3

Total: 34 marks

Summary of grades for assessment 4 2020, scaled to be out of a possible 10 marks
highest	median
9.6	5.6

Transcriptomics Applications

Assignment 4 Solutions