This is the 2nd part of co-expression network analysis using WGCNA. As we have learned in the part 1, there are 5 general steps for co-experssion network analysis using WGCNA:
We have learned how to import and preprocess data (step 1) and how to construct network and identify co-expression modules (step 2) in the last practice. In this practice, we will learn how to perform the analysis in step 3, step 4 and step 5.
As per usual, we’ll work mainly in R Markdown today, so create a new R Project in your ~/transcriptomics
folder called week_10
or something else appropriate. Call today’s R Markdown whatever you’d like, but I’ve called mine 10.2_WGCNA_part2.Rmd
. My YAML is
---
title: "BIOINF3005/7160:<br>Transcriptomics Applications"
subtitle: "Week 10.2: Co-expression network analysis using WGCNA, part 2"
date: "22^th^ May 2020"
output:
html_document:
toc: yes
toc_depth: 2
toc_float: yes
---
My setup chunk is
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
fig.align = "center",
results = "hide",
fig.show = "hide"
)
First we need to load package WGCNA
, which is the only required package in today’s analysis
library(WGCNA)
options(stringsAsFactor = FALSE)
In case you forgot what kinds of input data we are going to use for WGCNA analysis. Here I’ll remind you the two input datasets. We only need two datasets for a typical WGCNA analysis, gene expression data and external clinical data. We used a subset of RNA-Seq data from a TCGA (The Cancer Genome Atlas) study for skin cutaneous melanomas (SKCM) as the gene experssion data. The gene expression data is a RNA-Seq raw counts file including expression profiles of human reference genes from primary tumor tissues of 41 patients with SKCM. The clinical data used in this analysis is the so called lymphocyte score, which summarises the lymphocyte distribution and density in the pathological view. For more information about the datasets, please read the article (DOI: 10.1016/j.cell.2015.05.044).
We have used gene experssion data to construct the gene co-expression network and detected co-expression modules in the last Prac. In today’s analysis, we will need the clinical data, which includes the lymphocyte score for 41 patients. The file including clinical data is available here so please right-click on that link to get the file path. Then, move to your terminal in RStudio and use wget
to download the file into your current R Project. Make sure it’s called WGCNA_PARC_clinical_data.tsv
and is in the parent directory of your R Project.
In last Prac, we saved some R objects into a file called “WGCNA_RNASeq_part1.RData”. In case that you haven’t finished the part 1 analysis or you couldn’t find your saved file, you can download the “WGCNA_RNASeq_part1.RData” from here. As usual, please right-click on that link to get the file path. Then, move to your terminal in RStudio and use wget
to download the file into your current R Project.
We will use function load()
to import those saved R objects into our current workspace. After loading the file, we can use ls()
to check whether those R objects are successfully loaded and what they are. The function should return the names of 5 saved objects, named “dynamicColors”, “dynamicMods”, “geneTree”, “MEs” and “WGCNAmatrix” respectively.
#load saved R objects from WGCNA part1
load("WGCNA_RNASeq_part1.RData") #include the file path if your file is not in current directory
ls()
The external clinical data file is a tab-delimited file, I use read.table()
to import file into R and use 1st row of the file as the column names. In this external clinical data, there is only 1 clinical trait, which is lymphocyte score, but in real world study, you may have multiple clinical traits. The analysis that we performed today should be able to applied to clinial data with multiple traits. The same as part 1, the imported data format is data.frame
format, so we remove the 1st column (LYMPHOCYTE.SCORE) and convert data.frame
to matrix
format for the clinical data and designate sample/patient ID as row names. We call this clinical traits matrix as “TRAITSmatrix” in this analysis. One important thing is that we have to make sure the sample/patient IDs of clinical traits (rownames of “TRAITSmatrix”) are in the same order as samples/patient IDs in our gene expression data (rownames of “WGCNAmatrix”), so we use “==”, which is the “equal to” operater in R to make the comparison.
traits.df = read.table("WGCNA_PRAC_clinical_data.tsv", header = T) #include the file path if your file is not in current directory
TRAITSmatrix = as.matrix(traits.df[, -1])
rownames(TRAITSmatrix) = traits.df[, 1]
colnames(TRAITSmatrix) = colnames(traits.df)[-1]
table(rownames(WGCNAmatrix) == rownames(TRAITSmatrix))
As we learned in the lecture, The module eigengene can be considered as the representive of overall gene expression profiles in a co-expression module. Next we will learn how to use heatmap and barplot to view this representation intuitively.
select.module = "green"
ME = MEs[, paste("ME", select.module, sep = "")]
par(mfrow = c(2,1), mar = c(0.3, 5.5, 3, 2))
plotMat(t(scale(WGCNAmatrix[, dynamicColors == select.module])),
nrgcols = 30, rlabels = FALSE, rcols = select.module,
main = select.module, cex.main = 2)
par(mar = c(5, 4.2, 0, 0.7))
barplot(ME, col = select.module, main = "", cex.names = 2,
ylab = "eigengene expression", xlab = "RNA-Seq sample")
In the generated plot, the top row shows heatmap of the green module genes (rows) across the 41 samples/patitents (columns). The lower row shows the corresponding module eigengene expression values (y-axis) versus the same samples/patients. Note that the module eigengene takes on low values in samples where a lot of module genes are under-expressed (green color in the heatmap). The module eigengene takes on high values for samples where a lot of module genes are over-expressed (red in the heatmap). That’s why we say module eigengenes can be considered as the most representative gene expression profile of the module.
There are two ways to quantify the associations between co-expression modules and clinical traits. As we have calculated the module eigengenes (MEs) as the most representative expression profile of the modules, the first way would be to simply correlate MEs with clinical traits.
In this example, we use Pearson correlation measurement to calculate the correlation. And we can also attach clinical trait(s) to MEs matrix, calculate the correlations, cluster the MEs with traits, and plot their associations as dendrogram and heatmap.
nSamples = nrow(WGCNAmatrix) #get the number of samples and store in "nSamples"
moduleTraitCor = cor(MEs, TRAITSmatrix, use = "p")
moduleTraitCor #view MEs-trait correlations
moduleTraitPvalue = corPvalueStudent(moduleTraitCor, nSamples)
moduleTraitPvalue #view statistial significance of MEs-trait correlations
MET = orderMEs(cbind(MEs, TRAITSmatrix))
par(mfrow = c(1,1))
plotEigengeneNetworks(MET, "", signed = TRUE, plotAdjacency = FALSE,
marDendro = c(0, 4.5, 1, 3),
marHeatmap = c(3, 5, 1, 1),
cex.lab = 0.8, xLabelsAngle = 90)
In the generated plot, the top panel (cluster tree) shows a hierarchical clustering dendrogram of the MEs in which the dissimilarity of eigengenes \(E_i\) and \(E_j\) is given by \(1−cor(E_i,E_j)\). The bottom panel heatmap shows the MEs adjacency \(A_{ij} = \frac{(1 + cor(E_i, E_j))}{2}\).
Based on the correlations, we can see that module “cyan” and module “greenyellow” have moderate correlations with our clinical trait “LYMPHOSITE.SCORE”.
Another way to quantify the associations between co-expression modules and clinical traits will be using the gene significance (GS) measure. In this analysis, the gene significance measure \(GS_i\) of gene \(i\) is defined by \(GS_i = |cor(x_i, T)|^\beta\) as we learned in lecture. We will use “4” as power \(\beta\). The GS is also ane way to measure the potential biological significance between individual genes and traits.
TRAITlscore = TRAITSmatrix[, 1]
GS = abs(cor(WGCNAmatrix, TRAITlscore, use = "p"))^4
head(GS)
GS.sorted = as.matrix(GS[order(GS[,1], decreasing = T),])
head(GS.sorted)
We can also define a measure of module significance \(moduleSignif\) as the average gene significance of all genes in the module. As we learned in the lecture, the equation to calculate module significance is \(moduleSignif = \frac{\sum_{i}GS_i}{n}\). We use the absolute value for defining a correlation based gene significance measure. We use one R loop function named tapply()
to calculate the average gene significance of all genes in each module. Basically, tapply()
applies a function or operation on subset of the vector broken down by a given factor variable (“dynamicColors” in our example). For more information about function tapply()
, type ?tapply
to get the full manual. We can use function plotModuleSignificance()
in WGCNA
to plot the module significance. The plot is actually a barplot showing the average gene significance (\(GS\)) with variance.
moduleSignif = tapply(GS, dynamicColors, mean, na.rm = T)
moduleSignif
par(mfrow = c(1,1), mar = c(5, 5, 4, 2))
plotModuleSignificance(GS, dynamicColors)
The \(moduleSignif\) also shows that module “cyan” (\(moduleSignif = 0.0092680339\)) and module “greenyellow” (\(moduleSignif = 0.0096785844\)) have the highest values, which are consistent with what we observed from the 1st way of quantifying correlation between MEs and traits.
So far, we have identified potentially biological interesting co-expression modules. The next question would be how we can identify interesting/important genes in one co-expression module? As we learned in the lecture, we will use an eigengene-based intramodular connectivity measure (also called “module membership measure”), which is simply defined by the correlation between the \(ith\) gene expression profile (\(x_i\)) and the module eigengenes: \(kME_i = moduleMembership(i) = cor(x_i, ME)\).
kME = cor(WGCNAmatrix, MEs, use = "p")
kMEpvalue = corPvalueStudent(kME, nSamples)
dim(kME)
kME[1:5, 1:5]
Then we can identify genes that have a high significance with clinical data as well as high module membership in interesting modules, e.g. module “greenyellow” in our analysis. We use a function verboseScatterplot
in WGCNA
to generate a scatter plot showing the correlation between gene significance (\(GS\)) and module membership for module “greenyellow”.
select.module = "greenyellow"
moduleID = "MEgreenyellow"
moduleGenes = (dynamicColors == select.module)
table(moduleGenes)
kME.module = kME[moduleGenes, moduleID]
GS.module = GS[moduleGenes, 1]
verboseScatterplot(abs(kME.module),
abs(GS.module),
xlab = paste("Module Membership in", select.module, "module"),
ylab = "Gene significance for lymphocyte score",
main = paste("Module memebership versus gene sinificance\n"),
col = select.module,
cex.main = 1.2, cex.lab = 1.2, cex.axis = 1.2)
abline(v = 0.6, h = 0.03, lty = 2, col = "red")
GOI = names(kME.module)[abs(kME.module) > 0.6 & abs(GS.module) > 0.03]
After we get a list of genes potentially biologically meaningful, we can perform additional analysis, such gene set enrichment analysis (GSEA) as you have learned from last week’s Prac to validate the biological significance of selected co-expression modules.
So far, we have finished all core components of gene co-expression network analysis. We have constructed co-expression modules, identified interesting modules and further identified central players in interestring modules. But wait for a moment, you may say I haven’t seen one single network in the whole analysis. The fact is that network is the intuitive way of showing the fundamentals of co-expression analysis, the relationship between genes with respect to their expression in multiple samples. We can export the relationship between genes (e.g. TOM) in co-expression modules as the connection of nodes and visualise it in network platform, e.g. cytoscape (https://cytoscape.org/).
TOM = TOMsimilarityFromExpr(WGCNAmatrix, corType = "pearson", networkType = "signed", power = 4)
dim(TOM)
TOM[1:5, 1:5]
nodeNames = colnames(WGCNAmatrix)[moduleGenes]
TOM.module = TOM[moduleGenes, moduleGenes]
cyt = exportNetworkToCytoscape(TOM.module,
edgeFile = paste("CytoscapeInput_edges_",
paste(select.module, collapse = "_"),
".txt", sep = ""),
nodeFile = paste("CytoscapeInput_nodes_",
paste(select.module, collapse = "_"),
".txt", sep = ""),
weighted = TRUE,
threshold = 0.02,
nodeNames = nodeNames,
nodeAttr = dynamicColors[moduleGenes])
For more information about gene co-expression network analysis, please have a look at the homepage of WGCNA (https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/).
Remember the key points of gene co-expression network anlaysis are:
Number one, biological significance! Number two, Biological significance! Number three, Biological significance!