This is assignment is due by 5pm, Tuesday 9th June.
All questions are to be answered on the same R Markdown / PDF, regardless of if they require a plain text answer, or require execution of code.
Marks directly correspond to the amount of time and effort we expect for each question, so please answer with this is mind.
We strongly advise working in the folder ~/transcriptomics/assignment4
on your virtual machine. Using an R Project for each individual assignment is also strongly advised.
If all files required for submission are contained on your VM:
.zip
If all files are on your on your local Windows machine:
Send to > Compressed (zipped) folder
.zip
If all the files are on your *local macOS machine`:
A Module Eigengene (ME) is one of the fundamental concepts of gene co-expression network analysis using WGCNA.
We have discussed that the key point of WGCNA analysis is to discover biological significance. Describe how can we build the link between our co-expression network analysis and biological significance.
Bulk RNA-seq and scRNA-seq differ in many ways. Describe two common aspects shared between the two approaches, and provide details about two key differences between the two.
Use the given gene expression dataset and corresponding clinical data to identify co-expression module(s) with the strongest correlation with clinical data. The gene expression dataset includes raw counts for human reference genes across 37 samples (skin cancer patients), and the clinical data includes diagnosed cancer stage status of 37 patients. The results should minimally include:
library(WGCNA)
library(limma)
library(ape)
library(matrixStats)
library(tidyverse)
library(cowplot)
theme_set(theme_bw())
First we need to load the data and transform using voom so we get manageable CPM values
counts <- read_tsv("data/WGCNA_rawCounts.tsv") %>%
as.data.frame() %>%
column_to_rownames("geneID") %>%
as.matrix() %>%
.[rowSums(. > 0) > 0.8*ncol(.),]
voomCounts <- voom(counts)$E
To find the most variables genes we can use the MAD (median absolute deviation). Let’s select the most variable 5000, then check for outliers.
mads <- rowMads(voomCounts)
madCutoff <- quantile(mads, probs = 1 - 5000/length(mads))
hist(mads, breaks = 100)
abline(v = madCutoff, col = "red")
mat <- t(voomCounts[mads > madCutoff,])
mat %>%
dist %>%
hclust(method = "average") %>%
plot(
main = "Sample clustering to detect outliers", cex.main = 1.5,
sub = "",
xlab = ""
)
Next we need to find our soft-thresholding power
powers <- c(c(1:10), seq(from = 12, to = 20, by = 2))
sft <- pickSoftThreshold(mat, powerVector = powers, verbose = 0)
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.0284 0.901 0.963 909.000 9.01e+02 1280.0
## 2 2 0.1560 -0.944 0.832 263.000 2.48e+02 547.0
## 3 3 0.7550 -1.870 0.851 98.400 8.47e+01 332.0
## 4 4 0.9220 -1.970 0.916 44.500 3.30e+01 237.0
## 5 5 0.9550 -1.850 0.943 23.200 1.42e+01 185.0
## 6 6 0.9450 -1.720 0.930 13.600 6.58e+00 152.0
## 7 7 0.9340 -1.610 0.920 8.680 3.25e+00 128.0
## 8 8 0.9140 -1.540 0.906 5.950 1.68e+00 110.0
## 9 9 0.9210 -1.460 0.916 4.300 9.13e-01 95.7
## 10 10 0.9320 -1.400 0.929 3.240 5.06e-01 84.1
## 11 12 0.9450 -1.310 0.949 2.010 1.70e-01 66.3
## 12 14 0.9270 -1.280 0.933 1.360 6.34e-02 53.5
## 13 16 0.9310 -1.250 0.929 0.965 2.54e-02 43.9
## 14 18 0.9620 -1.200 0.961 0.715 1.06e-02 36.4
## 15 20 0.9640 -1.170 0.959 0.546 4.61e-03 30.6
a <- sft$fitIndices %>%
mutate(y = -sign(slope)*SFT.R.sq) %>%
ggplot(aes(Power, y, label = Power)) +
geom_text(colour = "red") +
geom_hline(yintercept = 0.9, colour = "red") +
labs(
x = "Soft Threshold (power)",
y = "Scale Free Topology Model Fit,signed R^2"
)
b <- sft$fitIndices %>%
ggplot(aes(Power, mean.k., label= Power)) +
geom_text(colour = "red") +
labs(
x = "Soft Threshold (power)",
y = "Mean Connectivity"
)
plot_grid(a, b, nrow = 1)
Let’s choose the power \(\beta = 4\) as that is where the signed \(R^2\) value is large enough, but significant connectivity is still retained. Then we’ll obtain our adjacency matrix & dissimilarity.
adjMat <- adjacency(mat, type = "signed", power = 4)
TOM <- TOMsimilarity(adjMat)
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
dissTOM <- 1 - TOM
Now we can detect modules
geneTree <- dissTOM %>%
as.dist %>%
hclust(method = "average")
plot(
geneTree,
xlab = "", sub = "", main = "Gene clustering on TOM-based dissimilarity",
labels = FALSE,
hang = 0.04
)
Module identification using dynamic tree cut
minModuleSize <- 30
dynamicMods <- cutreeDynamic(
dendro = geneTree,
distM = dissTOM,
deepSplit = 4,
pamRespectsDendro = FALSE,
minClusterSize = minModuleSize
)
## ..cutHeight not given, setting it to 0.907 ===> 99% of the (truncated) height range in dendro.
## ..done.
table(dynamicMods)
## dynamicMods
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1776 401 398 350 333 198 179 150 133 129 109 78 73 72 67 62
## 17 18 19 20 21 22 23 24 25 26
## 61 59 58 55 54 53 44 39 38 31
Assign module colours and plot the dendrogram and corresponding colour bars underneath
dynamicColors <- labels2colors(dynamicMods)
table(dynamicColors)
## dynamicColors
## black blue brown cyan darkgreen
## 179 401 398 72 53
## darkgrey darkorange darkred darkturquoise green
## 39 31 54 44 333
## greenyellow grey60 lightcyan lightgreen lightyellow
## 109 61 62 59 58
## magenta midnightblue orange pink purple
## 133 67 38 150 129
## red royalblue salmon tan turquoise
## 198 55 73 78 1776
## yellow
## 350
plotDendroAndColors(
dendro = geneTree,
colors = dynamicColors,
groupLabels = "Dynamic Tree Cut",
dendroLabels = FALSE,
hang = 0.03,
addGuide = TRUE,
guideHang = 0.05,
main = ""
)
Find the eigengenes and check how similar they are to each other
MEList <- moduleEigengenes(mat, colors = dynamicColors)
MEs <- MEList$eigengenes
MEDiss <- 1 - cor(MEs)
par(
mfrow = c(1, 1),
mar = c(2,2,2,2)
)
MEDiss %>%
as.dist() %>%
hclust(method = "average") %>%
as.phylo() %>%
plot.phylo(
type = "fan",
tip.color = dynamicColors %>%
as.factor() %>%
levels(),
label.offset = 0.06,
main = "Clustering of module eigengenes"
)
tiplabels(
frame = "circle",
col = "black",
text = rep("", length(unique(dynamicMods))),
bg = levels(as.factor(dynamicColors))
)
Our final step is to associate each module with the clinical traits
traits <- read_tsv("data/WGCNA_clinical_data.tsv") %>%
as.data.frame() %>%
column_to_rownames("sampleID") %>%
as.matrix() %>%
.[rownames(mat),] %>%
as.factor()
meCors <- MEs %>%
cor(
as.integer(traits),
use = "p"
) %>%
as.data.frame() %>%
setNames("Correlation") %>%
rownames_to_column("ME") %>%
as_tibble() %>%
mutate(
p = corPvalueStudent(Correlation, nSamples = nrow(mat))
) %>%
arrange(p)
head(meCors)
## # A tibble: 6 x 3
## ME Correlation p
## <chr> <dbl> <dbl>
## 1 MEdarkgrey -0.263 0.121
## 2 MEturquoise -0.221 0.195
## 3 MEblack -0.210 0.218
## 4 MEgrey60 -0.209 0.221
## 5 MEorange -0.191 0.265
## 6 MElightyellow -0.171 0.318
It appears that for my parameters no modules show any compelling correlations. For my analysis, darkgrey appears to most correlated with the tumour stage, so let’s check the eigengene and overall gene expression patterns. To be honest, they don’t look really that exciting.
modName <- "darkgrey"
meName <- paste("ME", modName, sep = "")
a <- mat[, dynamicColors == modName] %>%
scale() %>%
as.data.frame() %>%
rownames_to_column("sample") %>%
pivot_longer(
cols = -sample,
values_to = "scaledExp",
names_to = "gene"
) %>%
mutate(
Stage = traits[sample]
) %>%
ggplot(
aes(sample, gene, fill = scaledExp)
) +
geom_raster() +
scale_fill_gradient2(low = "green", mid = "black", high = "red") +
facet_grid(.~Stage, scales = "free_x", space = "free_x") +
labs(y = "Gene") +
ggtitle(paste("Expression patterns for the", modName, "module")) +
theme(
axis.text.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
legend.position = "none",
plot.title = element_text(hjust = 0.5)
)
b <- MEs %>%
as.data.frame() %>%
rownames_to_column("sample") %>%
dplyr::select(sample, one_of(meName)) %>%
rename_all(
str_replace_all, pattern = meName, replacement = "ME"
) %>%
mutate(
Stage = traits[sample]
) %>%
ggplot(
aes(sample, ME)
) +
geom_col() +
facet_grid(.~Stage, scales = "free_x", space = "free_x") +
labs(
x = "Sample",
y = "ME Expression"
) +
theme(
axis.text.y = element_blank(),
axis.text.x = element_blank(),
axis.ticks = element_blank()
)
plot_grid(
a, b, nrow = 2, rel_heights = c(7, 3)
)
Using the SingleCellExperiment object provided, perform the following steps. Please note that undetectable genes and low quality cells have already been removed from this dataset. Counts are already normalised and log transformed, and contain expression patterns from 622 mouse neuronal cells.
Load the data, then we can define the most variable 1000 genes and see what range of expression values they come from
library(scran)
library(scater)
sce <- read_rds("data/sce.rds")
geneVars <- modelGeneVar(sce)
hvg <- getTopHVGs(geneVars, n = 1000)
geneVars %>%
metadata() %>%
with(
tibble(gene = names(mean), mean, var, trend = trend(mean))
) %>%
mutate(hvg = gene %in% hvg) %>%
ggplot(aes(mean, var, colour = hvg)) +
geom_point() +
geom_line(aes(y = trend), colour = "blue") +
scale_colour_manual(values = c("black", "red"))
Use the highly-variable genes to perform a PCA, and decide how many PCs to keep. I’ve chosen 8, but any number around 7-8 would be acceptable. Now we can plot the data, before clustering just to see how similar our cells are.
sce <- runPCA(sce, subset_row = hvg)
sce %>%
reducedDim("PCA") %>%
attr("percentVar") %>%
enframe(value = "Var (%)", name = c()) %>%
mutate(PC = seq_along(`Var (%)`)) %>%
ggplot(aes(PC, `Var (%)`)) +
geom_point()
reducedDim(sce, "PCA") <- reducedDim(sce, "PCA")[,1:8]
plotReducedDim(sce, dimred = "PCA")
Using a shared-neighbour graph, let’s form clusters and see where they are on our PCA plot
g <- buildSNNGraph(sce, k = 15, use.dimred = "PCA")
sce$snnCluster <- igraph::cluster_walktrap(g)$membership %>%
as.factor()
table(sce$snnCluster)
##
## 1 2 3 4 5 6
## 104 151 161 86 77 43
plotReducedDim(sce, dimred = "PCA", colour_by = "snnCluster", text_by = "snnCluster")
The clusters don’t appear well defined using PCA, so for this dataset a TSNE looks to be a better visualisation.
sce <- runTSNE(sce, dimred = "PCA")
plotReducedDim(sce, dimred = "TSNE", colour_by = "snnCluster", text_by = "snnCluster")
Find the marker genes. For this approach, we’ll choose the genes from each cluster which are always more highly expressed in comparison to other clusters
markersAll <- findMarkers(sce, groups = sce$snnCluster, direction = "up", pval.type = "all")
markersAll %>%
lapply(subset, FDR < 0.05) %>%
sapply(nrow)
## 1 2 3 4 5 6
## 17 1567 178 4 9 181
Let’s explore cluster 1.
markersAll$`1`[1:10,]
## DataFrame with 10 rows and 7 columns
## p.value FDR logFC.2
## <numeric> <numeric> <numeric>
## Tppp3 2.95591742398886e-16 3.75933577982903e-12 1.88919801824908
## Rplp2 4.59264058735744e-10 2.9204601495006e-06 2.23708577027848
## Rps6 2.72112236247504e-09 1.15357447353192e-05 1.79411270838298
## Pmm1 3.66325296116236e-09 1.16473127900157e-05 1.96150262228162
## Fau 5.10799806934531e-08 0.000129927038891867 1.22764418594497
## Cisd1 1.64253268279868e-07 0.000348162177663893 1.50406015400436
## Rps3a 3.8972211271847e-07 0.00063551233044104 1.30589580008479
## Rgs10 3.99756144325234e-07 0.00063551233044104 5.93927604478946
## Tpt1 7.13030850567548e-07 0.00100759181750201 0.911288819082753
## Gm4340 1.14356542797001e-06 0.00145438651129226 1.16878171873654
## logFC.3 logFC.4 logFC.5 logFC.6
## <numeric> <numeric> <numeric> <numeric>
## Tppp3 1.13346077889811 1.01413629567347 2.49020656254163 1.12378593799301
## Rplp2 1.58748002156269 3.28618533851297 3.19596334642597 2.40364340640982
## Rps6 1.27906213277113 1.9609988716549 2.71967865091889 1.58206930727463
## Pmm1 2.03686222003743 2.20715635632626 4.71371612966699 1.01086320954415
## Fau 0.783889249367567 0.981981536619182 2.14123724561381 0.89489080152237
## Cisd1 1.28701037235121 1.80039905879166 3.85977617722862 1.72514540683578
## Rps3a 0.662351876056874 1.03503516597734 2.24045961891898 0.475663161589944
## Rgs10 1.54837882477117 0.518906104862294 6.07627594270224 1.68155751942136
## Tpt1 1.40997669635374 2.40366099519919 3.08097162642202 1.02548885184942
## Gm4340 1.14034610048668 1.19926542803041 1.19926542803041 1.19926542803041
gn <- markersAll$`1` %>%
subset(FDR < 0.01) %>%
rownames() %>%
.[1:10]
plotExpression(sce, features = gn, x = "snnCluster", colour_by = "snnCluster")
These look really unconvincing. The issue here is the use of the argument pval.type = "all"
argument. Given that cluster 1 appears closely connected to clusters 4 & 6 there may be no consistent marker which separates these. Cluster 2 however is highly distinct and should produce more compelling results from this strategy.
gn <- markersAll$`2` %>%
subset(FDR < 0.01) %>%
rownames() %>%
.[1:10]
plotExpression(sce, features = gn, x = "snnCluster", colour_by = "snnCluster")
This approach does need further refinement. Whilst the clusters look OK, finding key marker genes may prove difficult. Changing to pval.type = "some"
may be advantageous and allow for identification of common markers across cluster 1, 4 & 6. Re-clustering may also be viable.
plotExpression(
sce,
features = findMarkers(sce, groups = sce$snnCluster, direction = "up", pval.type = "some")$`1` %>%
rownames() %>%
.[1:10],
x = "snnCluster",
colour_by = "snnCluster"
)
highest | median |
---|---|
9.5 | 7.7 |