Instructions

Submission Format [3 marks]

This is assignment is due by 5pm, Thursday 9^th April.

Submissions must be made as a zip archive containing 2 files:
1. Your source R Markdown Document (with an Rmd suffix)
2. A compiled pdf, showing all code
All file names within the zip archive must start with your student number. However the name of the zip archive is not important as myUni will likely modify this during submission. See here for help creating a zip archive

All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.

Marks directly correspond to the amount of time and effort we expect for each question, so please answer with this is mind.

We strongly advise working in the folder ~/transcriptomics/assignment2 on your virtual machine. Using an R Project for each individual assignment is also strongly advised.

Creating a zip archive

On Your VM

If all files required for submission are contained on your VM:

Select both files using the Files pane in R Studio
Click export
They will automatically be placed into a single zip archive. Please name this in whatever informative name you decide is suitable, but it should contain the suffix .zip

Windows

If all files are on your on your local Windows machine:

Using File Explorer, enter the folder containing both files
Select all files simultaneously by using Ctrl + Click
Right click on one of the files and select Send to > Compressed (zipped) folder
Rename as appropriate, ensuring the archive ends with the suffix .zip

Mac OS

If all the files are on your *local macOS machine`:

Locate the items to zip in the Mac Finder (file system)
Right-click on a file, folder, or files you want to zip
Select “Compress Items”
Find the newly created .zip archive in the same directory and name as appropriate

Questions

Question 1 [3 marks]

Many early technologies still have a relevant place in modern transcriptomics. Of the technologies covered in Lecture 2: Early Transcriptomic Strategies, which technology might be suitable for analysis of pri-miRNAs? Explain why in one or two brief sentences.

CAGE is actively used in study of miRNA as it is able to distinguish the multiple pri-miRNAs which lead to identical miRNA
This takes advantage of the 5’ cap and can uniquely identify the locus of origin

Question 2 [8 marks]

Briefly contrast two of the technologies presented in Lecture 2: Early Transcriptomic Strategies and Lecture 3: Microarrays. Discuss their strengths and limitations, paying particular attention to how each represented a breakthrough at the time they gained prominence.

Some possible discussion points are

Northern blot:
- Limited sequences are able to be probed for
- Relative Quantiation is crude
- Alternate isoforms can be identified
ESTs:
- The sequences of expressed transcript were able to be determined for the first time using a high-throughput approach
- No accurate quantification was able to be performed and throughput was still limited to quantities which were feasible for Sanger-sequencing
CAGE/SAGE:
- Process is very complicated
- Relatively accurate quantitation (compared to microarrays)
- Sequences do not need to be known a priori
RT-qPCR:
- Used for single genes or targeted sequences \(\implies\) low throughput
- Accurate quantitation
Microarrays:
- Expression estimates and the relative dynamics were now able to be quantified in the thousands
- Genes were limited to those identified at design time
- The dynamic range was restricted by the limits of the scanner
- LImited ability to idenfity multiple isoforms
- Background noise was problematic

Question 3 [3 marks]

When performing a statistical test, we usually frame our analysis in terms of a Null Hypothesis and an Alternate Hypothesis, eventually returning a \(p\)-value. Explain why we do this, including clear description of what a \(p\)-value represents?

The Null Hypothesis (\(H_0\)) gives us a chance to describe what the data should like like if nothing is happening. By implication anything that is observed, but that appears unlikely under \(H_0\) means that something is happening.
A p-value represents the probability of observing data as (or more) extreme than we have observed if the Null Hypothesis were true

Question 4 [3 marks]

When conducting a \(T\)-test, we estimate two population-level parameters. Describe both of these, using the context of comparing gene expression patterns across two treatment groups.

The population mean \(\mu\) and the population variance \(\sigma^2\)
In the context of gene expression:
- \(\mu\) is the average expression level across all possible individuals
- \(\sigma\) is the variability in expression level across all possible individuals

Question 5 [10 marks]

To obtain your own set of qPCR data, please execute the following lines of code, using your own student number instead of the example given ("a1234567"). This will create an object called qPCR in your R Environment. This object will contain \(C_t\) values for a gene of interest and a housekeeper gene, across two cell types. Each experiment is run as a series of pairs so that you will have 4 values for each pair (2X Cell Types + 2X Genes).

source("https://uofabioinformaticshub.github.io/transcriptomics_applications/assignments/A2Funs.R")
makeRT("a1234567")

For this question, please perform the following tasks, showing all code. Where suitable, use pander() to display the results.

Generate a clearly labelled boxplot for each gene and cell type.
After generation of the boxplot, comment on the suitability of the housekeeper, paying particular note to the stability of it’s expression between your cell types.
Calculate the \(\Delta C_t\) values for your gene of interest by normalising to the housekeeper
Calculate the \(\Delta \Delta C_t\) values for the comparison between cell types
Using the above \(\Delta \Delta C_t\) values, perform a \(T\)-test using the R function t.test() and interpret the output

qPCR

## # A tibble: 16 x 5
##    cellType  pair gene     Ct type            
##    <fct>    <int> <fct> <dbl> <fct>           
##  1 Th           1 RPL19  15.5 Housekeeper Gene
##  2 Th           1 FOXO1  12.9 Study Gene      
##  3 Th           2 RPL19  14.6 Housekeeper Gene
##  4 Th           2 FOXO1  12.1 Study Gene      
##  5 Th           3 RPL19  15.0 Housekeeper Gene
##  6 Th           3 FOXO1  13.5 Study Gene      
##  7 Th           4 RPL19  15.0 Housekeeper Gene
##  8 Th           4 FOXO1  14.2 Study Gene      
##  9 Treg         1 RPL19  15.6 Housekeeper Gene
## 10 Treg         1 FOXO1  15.4 Study Gene      
## 11 Treg         2 RPL19  15.3 Housekeeper Gene
## 12 Treg         2 FOXO1  13.1 Study Gene      
## 13 Treg         3 RPL19  14.7 Housekeeper Gene
## 14 Treg         3 FOXO1  13.9 Study Gene      
## 15 Treg         4 RPL19  14.9 Housekeeper Gene
## 16 Treg         4 FOXO1  14.2 Study Gene

qPCR %>%
  ggplot(aes(gene, Ct, fill = gene)) +
  geom_boxplot() +
  facet_wrap(~cellType) +
  theme_bw()

For this dataset, the expression of the housekeeper RPL19 appears relatively stable across both cell types. This may not have been the case for some students’ data

Calculation of dCT using pivot_wider() to rearrange columns. There are numerous alternative approaches, but this was the one I chose.

dCT <- qPCR %>%
  pivot_wider(
    id_cols = c(cellType, pair),
    names_from = gene,
    values_from = Ct
  ) %>%
  mutate(
    dCt = FOXO1 - RPL19
  )
dCT

## # A tibble: 8 x 5
##   cellType  pair RPL19 FOXO1    dCt
##   <fct>    <int> <dbl> <dbl>  <dbl>
## 1 Th           1  15.5  12.9 -2.65 
## 2 Th           2  14.6  12.1 -2.53 
## 3 Th           3  15.0  13.5 -1.51 
## 4 Th           4  15.0  14.2 -0.826
## 5 Treg         1  15.6  15.4 -0.189
## 6 Treg         2  15.3  13.1 -2.22 
## 7 Treg         3  14.7  13.9 -0.886
## 8 Treg         4  14.9  14.2 -0.709

Calculation of ddCT using a similar strategy

ddCT <- dCT %>%
  pivot_wider(
    id_cols = pair,
    names_from = cellType,
    names_prefix = "dCt_",
    values_from = dCt
  ) %>%
  mutate(
    ddCT = dCt_Th - dCt_Treg
  )
ddCT

## # A tibble: 4 x 4
##    pair dCt_Th dCt_Treg   ddCT
##   <int>  <dbl>    <dbl>  <dbl>
## 1     1 -2.65    -0.189 -2.46 
## 2     2 -2.53    -2.22  -0.308
## 3     3 -1.51    -0.886 -0.624
## 4     4 -0.826   -0.709 -0.116

To perform the t-test

t.test(ddCT$ddCT)

## 
##  One Sample t-test
## 
## data:  ddCT$ddCT
## t = -1.6308, df = 3, p-value = 0.2014
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -2.5864868  0.8338008
## sample estimates:
## mean of x 
## -0.876343

For \(H_0:\) true average change in FOXO1 expression is zero between Th and Treg, the t-test produced a p-value of 0.2014. As this is \(>0.05\) we would accept \(H_0\) and conclude that the true average difference in expression is zero.

Question 6 [10 marks]

For this question, you will need the objects cpm.tsv, topTable.csv and de.tsv. You will be assigned a gene set using the following command, again remembering to use your own student ID number.

source("https://uofabioinformaticshub.github.io/transcriptomics_applications/assignments/A2Funs.R")
chooseGeneSet("a1234567")

For this question, your task is to:

Find which differentially expressed genes belong to this gene-set. These are provided in the object de.tsv, and these should be formed into a character vector.
Restrict this character vector so that it only contains genes within cpm.tsv. (Hint: You can use topTable.csv to map from gene names to gene IDs)
Using pheatmap(), create a heatmap of these genes using the expression values contained in cpm.tsv. For reference, these values are provided as logCPM values, which are suitable for plotting directly. Include an annotation for each sample, indicating which genotype it represents.

chooseGeneSet("a1234567")

## [1] "GO_RESPONSE_TO_CYTOKINE"

First we need to find which DE genes belong to this geneset

de <- here::here("practicals/data/de.tsv") %>%
  read_tsv() %>%
  dplyr::filter(gs_name == "GO_RESPONSE_TO_CYTOKINE") 
deGenes <- str_split(de$DE, pattern = ";")[[1]]
deGenes

##  [1] "creb1b"   "rpl3"     "otud4"    "tuba1c"   "srm"      "rps16"   
##  [7] "uba52"    "psmd14"   "ccl25b"   "cd40"     "gbp1"     "mif"     
## [13] "sharpin"  "ube2kb"   "mt2"      "mmp2"     "twistnb"  "rplp0"   
## [19] "ptpn12"   "stat5a"   "ddost"    "slc25a5"  "irf1b"    "jagn1a"  
## [25] "hcls1"    "vrk2"     "selplg"   "calcoco2" "psmb10"   "tfr1a"   
## [31] "pkz"      "cmklr1"

Given that the object cpm.tsv and topTable.csv contain only a subset of genes, we can subset our vector of DE genes.

topTable <-  here::here("practicals/data/topTable.csv") %>%
  read_csv()
myDeGenes <- deGenes[deGenes %in% topTable$gene_name]
myDeGenes

## [1] "uba52"   "gbp1"    "sharpin" "ube2kb"  "tfr1a"

Now we can use these to create a heatmap

logCPM <-  here::here("practicals/data/cpm.tsv") %>%
  read_tsv()

library(pheatmap)
library(scales)
topTable %>%
  dplyr::filter(gene_name %in% myDeGenes) %>%
  dplyr::select(gene_id, gene_name) %>%
  left_join(logCPM) %>%
  dplyr::select(gene_name, starts_with("Ps")) %>%
  rename_all(
    str_replace, 
    pattern = ".+(Het|WT).+_F3_([0-9]+).+", 
    replacement = "\\1_\\2"
  ) %>%
  as.data.frame() %>%
  column_to_rownames("gene_name") %>%
  pheatmap(
    color = viridis_pal(option = "magma")(100),
    cutree_cols = 2,
    cutree_rows = 2,
    angle_col = 0
  )

Total: 37 marks

Summary of grades for assessment 2 2020, out of a possible 10 marks
highest	median
9.2	6.7

Transcriptomics Applications

Assignment 2