Instructions

Submission Format [3 marks]

This is assignment is due by 5pm, Tuesday 9th June.

  • Submissions must be made as a zip archive containing 2 files:
    1. Your source R Markdown Document (with an Rmd suffix)
    2. A compiled pdf, showing all code
  • All file names within the zip archive must start with your student number. However the name of the zip archive is not important as myUni will likely modify this during submission. See here for help creating a zip archive

All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.

Marks directly correspond to the amount of time and effort we expect for each question, so please answer with this is mind.

We strongly advise working in the folder ~/transcriptomics/assignment4 on your virtual machine. Using an R Project for each individual assignment is also strongly advised.

Creating a zip archive

On Your VM

If all files required for submission are contained on your VM:

  1. Select both files using the Files pane in R Studio
  2. Click export
  3. They will automatically be placed into a single zip archive. Please name this in whatever informative name you decide is suitable, but it should contain the suffix .zip

Windows

If all files are on your on your local Windows machine:

  1. Using File Explorer, enter the folder containing both files
  2. Select all files simultaneously by using Ctrl + Click
  3. Right click on one of the files and select Send to > Compressed (zipped) folder
  4. Rename as appropriate, ensuring the archive ends with the suffix .zip

Mac OS

If all the files are on your *local macOS machine`:

  1. Locate the items to zip in the Mac Finder (file system)
  2. Right-click on a file, folder, or files you want to zip
  3. Select “Compress Items”
  4. Find the newly created .zip archive in the same directory and name as appropriate

Questions

Question 1 [5 marks]

A Module Eigengene (ME) is one of the fundamental concepts of gene co-expression network analysis using WGCNA.

  1. Is ME one of the genes in a co-expression module? Explain your reasoning
  2. Describe the roles (usage) of ME in gene co-expression network analysis
  1. No, the module eigengene is a combined representation of all genes in the module
  2. The module eigengene has two primary functions, 1) to assess the relationship between modules and 2) to correlate the modules with clinical/phenotypic data

Question 2 [4 marks]

We have discussed that the key point of WGCNA analysis is to discover biological significance. Describe how can we build the link between our co-expression network analysis and biological significance.

  • We are attempting to find the the co-expression modules which are highly correlated with phenotypic status of our samples
  • We can use enrichment analysis to unearth the biology related to each specific module
  • We can broadly define the gene expression patterns and how they relate to phenotypic conditions under investigation

Question 3 [4 marks]

Bulk RNA-seq and scRNA-seq differ in many ways. Describe two common aspects shared between the two approaches, and provide details about two key differences between the two.

  • We are measuring abundances in both approaches using read counts aligned to each gene to represent this
  • Both approaches commonly use cDNA & reverse transcriptase to amplify the template RNA transcripts
  • We commonly measure reads across the length of the transcript for bulk-RNA< but this is not always the case in scRNA-Seq
  • We are able to assess cellular heterogeneity using scRNA-Seq, whilst this is not the primary purpose of bulk-RNA
  • Lack of detection for a gene within a cell is not evidence of lack of expression, whilst for bulk samples this can be assumed

Question 4 [12 marks]

Use the given gene expression dataset and corresponding clinical data to identify co-expression module(s) with the strongest correlation with clinical data. The gene expression dataset includes raw counts for human reference genes across 37 samples (skin cancer patients), and the clinical data includes diagnosed cancer stage status of 37 patients. The results should minimally include:

  1. A figure showing which and how the power beta is selected.
  2. A gene dendrogram and coloured co-expression modules detection using dynamic tree cut
library(WGCNA)
library(limma)
library(ape)
library(matrixStats)
library(tidyverse)
library(cowplot)
theme_set(theme_bw())

First we need to load the data and transform using voom so we get manageable CPM values

counts <- read_tsv("data/WGCNA_rawCounts.tsv") %>%
  as.data.frame() %>%
  column_to_rownames("geneID") %>%
  as.matrix() %>%
  .[rowSums(. > 0) > 0.8*ncol(.),]
voomCounts <- voom(counts)$E

To find the most variables genes we can use the MAD (median absolute deviation). Let’s select the most variable 5000, then check for outliers.

mads <- rowMads(voomCounts)
madCutoff <- quantile(mads, probs = 1 - 5000/length(mads))
hist(mads, breaks = 100)
abline(v = madCutoff, col = "red")

mat <- t(voomCounts[mads > madCutoff,])
mat %>%
  dist %>%
  hclust(method = "average") %>%
  plot(
    main = "Sample clustering to detect outliers", cex.main = 1.5,
    sub = "",
    xlab = ""
  )

Next we need to find our soft-thresholding power

powers <- c(c(1:10), seq(from = 12, to = 20, by = 2))
sft <- pickSoftThreshold(mat, powerVector = powers, verbose = 0)
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1   0.0284  0.901          0.963 909.000  9.01e+02 1280.0
## 2      2   0.1560 -0.944          0.832 263.000  2.48e+02  547.0
## 3      3   0.7550 -1.870          0.851  98.400  8.47e+01  332.0
## 4      4   0.9220 -1.970          0.916  44.500  3.30e+01  237.0
## 5      5   0.9550 -1.850          0.943  23.200  1.42e+01  185.0
## 6      6   0.9450 -1.720          0.930  13.600  6.58e+00  152.0
## 7      7   0.9340 -1.610          0.920   8.680  3.25e+00  128.0
## 8      8   0.9140 -1.540          0.906   5.950  1.68e+00  110.0
## 9      9   0.9210 -1.460          0.916   4.300  9.13e-01   95.7
## 10    10   0.9320 -1.400          0.929   3.240  5.06e-01   84.1
## 11    12   0.9450 -1.310          0.949   2.010  1.70e-01   66.3
## 12    14   0.9270 -1.280          0.933   1.360  6.34e-02   53.5
## 13    16   0.9310 -1.250          0.929   0.965  2.54e-02   43.9
## 14    18   0.9620 -1.200          0.961   0.715  1.06e-02   36.4
## 15    20   0.9640 -1.170          0.959   0.546  4.61e-03   30.6
a <- sft$fitIndices %>%
  mutate(y = -sign(slope)*SFT.R.sq) %>%
  ggplot(aes(Power, y, label = Power)) +
  geom_text(colour = "red") +
  geom_hline(yintercept = 0.9, colour = "red") +
  labs(
    x = "Soft Threshold (power)", 
    y = "Scale Free Topology Model Fit,signed R^2"
  )
b <- sft$fitIndices %>%
  ggplot(aes(Power, mean.k., label= Power)) +
  geom_text(colour = "red") +
  labs(
    x = "Soft Threshold (power)", 
    y = "Mean Connectivity"
  )
plot_grid(a, b, nrow = 1)

Let’s choose the power \(\beta = 4\) as that is where the signed \(R^2\) value is large enough, but significant connectivity is still retained. Then we’ll obtain our adjacency matrix & dissimilarity.

adjMat <- adjacency(mat, type = "signed", power = 4)
TOM <- TOMsimilarity(adjMat)
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
dissTOM <- 1 - TOM

Now we can detect modules

geneTree <- dissTOM %>%
  as.dist %>%
  hclust(method = "average")
plot(
  geneTree, 
  xlab = "", sub = "", main = "Gene clustering on TOM-based dissimilarity",
  labels = FALSE, 
  hang = 0.04
  )

Module identification using dynamic tree cut

minModuleSize <- 30
dynamicMods <- cutreeDynamic(
  dendro = geneTree, 
  distM = dissTOM, 
  deepSplit = 4,
  pamRespectsDendro = FALSE,
  minClusterSize = minModuleSize
)
##  ..cutHeight not given, setting it to 0.907  ===>  99% of the (truncated) height range in dendro.
##  ..done.
table(dynamicMods)
## dynamicMods
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 1776  401  398  350  333  198  179  150  133  129  109   78   73   72   67   62 
##   17   18   19   20   21   22   23   24   25   26 
##   61   59   58   55   54   53   44   39   38   31

Assign module colours and plot the dendrogram and corresponding colour bars underneath

dynamicColors <- labels2colors(dynamicMods)
table(dynamicColors)
## dynamicColors
##         black          blue         brown          cyan     darkgreen 
##           179           401           398            72            53 
##      darkgrey    darkorange       darkred darkturquoise         green 
##            39            31            54            44           333 
##   greenyellow        grey60     lightcyan    lightgreen   lightyellow 
##           109            61            62            59            58 
##       magenta  midnightblue        orange          pink        purple 
##           133            67            38           150           129 
##           red     royalblue        salmon           tan     turquoise 
##           198            55            73            78          1776 
##        yellow 
##           350
plotDendroAndColors(
  dendro = geneTree, 
  colors = dynamicColors, 
  groupLabels = "Dynamic Tree Cut",
  dendroLabels = FALSE, 
  hang = 0.03,
  addGuide = TRUE, 
  guideHang = 0.05, 
  main = ""
  )

Find the eigengenes and check how similar they are to each other

MEList <- moduleEigengenes(mat, colors = dynamicColors)
MEs <- MEList$eigengenes
MEDiss <- 1 - cor(MEs)
par(
  mfrow = c(1, 1),
  mar = c(2,2,2,2)
)
MEDiss %>%
  as.dist() %>%
  hclust(method = "average") %>%
  as.phylo() %>%
  plot.phylo(
    type = "fan",
    tip.color = dynamicColors %>%
      as.factor() %>%
      levels(), 
    label.offset = 0.06,
    main = "Clustering of module eigengenes"
  )
tiplabels(
  frame = "circle", 
  col = "black",
  text = rep("", length(unique(dynamicMods))),
  bg = levels(as.factor(dynamicColors))
  )

Our final step is to associate each module with the clinical traits

traits <- read_tsv("data/WGCNA_clinical_data.tsv") %>%
  as.data.frame() %>%
  column_to_rownames("sampleID") %>%
  as.matrix() %>%
  .[rownames(mat),] %>%
  as.factor()
meCors <- MEs %>%
  cor(
    as.integer(traits),
    use = "p"
  ) %>%
  as.data.frame() %>%
  setNames("Correlation") %>%
  rownames_to_column("ME") %>%
  as_tibble() %>%
  mutate(
    p = corPvalueStudent(Correlation, nSamples = nrow(mat))
  ) %>%
  arrange(p)
head(meCors)
## # A tibble: 6 x 3
##   ME            Correlation     p
##   <chr>               <dbl> <dbl>
## 1 MEdarkgrey         -0.263 0.121
## 2 MEturquoise        -0.221 0.195
## 3 MEblack            -0.210 0.218
## 4 MEgrey60           -0.209 0.221
## 5 MEorange           -0.191 0.265
## 6 MElightyellow      -0.171 0.318

It appears that for my parameters no modules show any compelling correlations. For my analysis, darkgrey appears to most correlated with the tumour stage, so let’s check the eigengene and overall gene expression patterns. To be honest, they don’t look really that exciting.

modName <- "darkgrey"
meName <- paste("ME", modName, sep = "")
a <- mat[, dynamicColors == modName] %>%
  scale() %>%
  as.data.frame() %>%
  rownames_to_column("sample") %>%
  pivot_longer(
    cols = -sample,
    values_to = "scaledExp",
    names_to = "gene"
  ) %>%
  mutate(
    Stage = traits[sample]
  ) %>%
  ggplot(
    aes(sample, gene, fill = scaledExp)
  ) +
  geom_raster() +
  scale_fill_gradient2(low = "green", mid = "black", high = "red") +
  facet_grid(.~Stage, scales = "free_x", space = "free_x") +
  labs(y = "Gene") +
  ggtitle(paste("Expression patterns for the", modName, "module")) +
  theme(
    axis.text.y = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks = element_blank(),
    axis.title.x = element_blank(),
    legend.position = "none",
    plot.title  = element_text(hjust = 0.5)
  )
b <- MEs %>%
  as.data.frame() %>%
  rownames_to_column("sample") %>%
  dplyr::select(sample, one_of(meName)) %>%
  rename_all(
    str_replace_all, pattern = meName, replacement = "ME"
  ) %>%
  mutate(
    Stage = traits[sample]
  ) %>%
  ggplot(
    aes(sample, ME)
  ) +
  geom_col() +
  facet_grid(.~Stage, scales = "free_x", space = "free_x") +
  labs(
    x = "Sample",
    y = "ME Expression"
  ) +
  theme(
    axis.text.y = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks = element_blank()
  )
plot_grid(
  a, b, nrow = 2, rel_heights = c(7, 3)
)

Question 5 [12 marks]

Using the SingleCellExperiment object provided, perform the following steps. Please note that undetectable genes and low quality cells have already been removed from this dataset. Counts are already normalised and log transformed, and contain expression patterns from 622 mouse neuronal cells.

  1. Perform PCA and generate a plot of the cells using PC1 and PC2
  2. Choosing a suitable method, define clusters of cell types
  3. Visualise your clusters using only one of either a PCA, UMAP or tSNE plot
  4. Find the top 10 marker genes for one of your clusters and visualise the expression patterns
  5. Discuss your clusters and comment on how effective your approach was

Load the data, then we can define the most variable 1000 genes and see what range of expression values they come from

library(scran)
library(scater)
sce <- read_rds("data/sce.rds")
geneVars <- modelGeneVar(sce)
hvg <- getTopHVGs(geneVars, n = 1000)
geneVars %>%
  metadata() %>%
  with(
    tibble(gene = names(mean), mean, var, trend = trend(mean))
  ) %>%
  mutate(hvg = gene %in% hvg) %>%
  ggplot(aes(mean, var, colour = hvg)) +
  geom_point() +
  geom_line(aes(y = trend), colour = "blue") +
  scale_colour_manual(values = c("black", "red"))

Use the highly-variable genes to perform a PCA, and decide how many PCs to keep. I’ve chosen 8, but any number around 7-8 would be acceptable. Now we can plot the data, before clustering just to see how similar our cells are.

sce <- runPCA(sce, subset_row = hvg)
sce %>%
  reducedDim("PCA") %>%
  attr("percentVar") %>%
  enframe(value = "Var (%)", name = c()) %>%
  mutate(PC = seq_along(`Var (%)`)) %>%
  ggplot(aes(PC, `Var (%)`)) +
  geom_point()

reducedDim(sce, "PCA") <- reducedDim(sce, "PCA")[,1:8]
plotReducedDim(sce, dimred = "PCA")

Using a shared-neighbour graph, let’s form clusters and see where they are on our PCA plot

g <- buildSNNGraph(sce, k = 15, use.dimred = "PCA")
sce$snnCluster <- igraph::cluster_walktrap(g)$membership %>%
  as.factor()
table(sce$snnCluster)
## 
##   1   2   3   4   5   6 
## 104 151 161  86  77  43
plotReducedDim(sce, dimred = "PCA", colour_by = "snnCluster", text_by = "snnCluster") 

The clusters don’t appear well defined using PCA, so for this dataset a TSNE looks to be a better visualisation.

sce <- runTSNE(sce, dimred = "PCA")
plotReducedDim(sce, dimred = "TSNE", colour_by = "snnCluster", text_by = "snnCluster") 

Find the marker genes. For this approach, we’ll choose the genes from each cluster which are always more highly expressed in comparison to other clusters

markersAll <- findMarkers(sce, groups = sce$snnCluster, direction = "up", pval.type = "all")
markersAll %>%
  lapply(subset, FDR < 0.05) %>% 
  sapply(nrow)
##    1    2    3    4    5    6 
##   17 1567  178    4    9  181

Let’s explore cluster 1.

markersAll$`1`[1:10,]
## DataFrame with 10 rows and 7 columns
##                     p.value                  FDR           logFC.2
##                   <numeric>            <numeric>         <numeric>
## Tppp3  2.95591742398886e-16 3.75933577982903e-12  1.88919801824908
## Rplp2  4.59264058735744e-10  2.9204601495006e-06  2.23708577027848
## Rps6   2.72112236247504e-09 1.15357447353192e-05  1.79411270838298
## Pmm1   3.66325296116236e-09 1.16473127900157e-05  1.96150262228162
## Fau    5.10799806934531e-08 0.000129927038891867  1.22764418594497
## Cisd1  1.64253268279868e-07 0.000348162177663893  1.50406015400436
## Rps3a   3.8972211271847e-07  0.00063551233044104  1.30589580008479
## Rgs10  3.99756144325234e-07  0.00063551233044104  5.93927604478946
## Tpt1   7.13030850567548e-07  0.00100759181750201 0.911288819082753
## Gm4340 1.14356542797001e-06  0.00145438651129226  1.16878171873654
##                  logFC.3           logFC.4          logFC.5           logFC.6
##                <numeric>         <numeric>        <numeric>         <numeric>
## Tppp3   1.13346077889811  1.01413629567347 2.49020656254163  1.12378593799301
## Rplp2   1.58748002156269  3.28618533851297 3.19596334642597  2.40364340640982
## Rps6    1.27906213277113   1.9609988716549 2.71967865091889  1.58206930727463
## Pmm1    2.03686222003743  2.20715635632626 4.71371612966699  1.01086320954415
## Fau    0.783889249367567 0.981981536619182 2.14123724561381  0.89489080152237
## Cisd1   1.28701037235121  1.80039905879166 3.85977617722862  1.72514540683578
## Rps3a  0.662351876056874  1.03503516597734 2.24045961891898 0.475663161589944
## Rgs10   1.54837882477117 0.518906104862294 6.07627594270224  1.68155751942136
## Tpt1    1.40997669635374  2.40366099519919 3.08097162642202  1.02548885184942
## Gm4340  1.14034610048668  1.19926542803041 1.19926542803041  1.19926542803041
gn <- markersAll$`1` %>%
  subset(FDR < 0.01) %>%
  rownames() %>%
  .[1:10]
plotExpression(sce, features = gn, x = "snnCluster", colour_by = "snnCluster")

These look really unconvincing. The issue here is the use of the argument pval.type = "all" argument. Given that cluster 1 appears closely connected to clusters 4 & 6 there may be no consistent marker which separates these. Cluster 2 however is highly distinct and should produce more compelling results from this strategy.

gn <- markersAll$`2` %>%
  subset(FDR < 0.01) %>%
  rownames() %>%
  .[1:10]
plotExpression(sce, features = gn, x = "snnCluster", colour_by = "snnCluster")

This approach does need further refinement. Whilst the clusters look OK, finding key marker genes may prove difficult. Changing to pval.type = "some" may be advantageous and allow for identification of common markers across cluster 1, 4 & 6. Re-clustering may also be viable.

plotExpression(
  sce,
  features = findMarkers(sce, groups = sce$snnCluster, direction = "up", pval.type = "some")$`1` %>%
  rownames() %>%
  .[1:10],
  x = "snnCluster",
  colour_by = "snnCluster"
  )

Total: 40 marks

Summary of grades for assessment 5 2020, scaled to be out of a possible 10 marks
highest median
9.5 7.7