For today’s session, we’ll continue to look at the microarray dataset from last week and will perform differential expression analysis on this set of genes. The main focus today is to understand how to apply a statistical model to a dataset.
All of the strategies we’ll cover today are from the R package limma
, which has been developed and supported by Prof Gordon Smyth for nearly 20 years. It is one of the most heavily used packages in Bioconductor, and the vignette is one of the most authoritative texts on general transcriptomic analysis, and the underlying models. As well as being a very high level R developer, Prof Smyth also happens to be one of the world’s great living statisticians, and one who is highly active in the transcriptomic research space.
In order to set ourselves up for the day, please ensure you are in your ~/transcriptomics
folder, and using an R Project called practical_6
within that folder. If you can’t find your way, then please ask for help.
Begin an R Markdown and tidy up the YAML header. Here’s an example YAML header (this is actually the YAML from this instruction page). Add your own name in the (currently missing) author:
field if you’d like, and feel free to change or delete the title/subtitle.
---
title: "BIOINF3005/7160:<br>Transcriptomics Applications"
subtitle: "Week 6.1: Linear Models in Differential Expression"
date: "8^th^ April 2020"
output:
html_document:
toc: yes
toc_float: yes
---
After the header, add the usual setup
chunk. Notice that I’ve added an extra line this time to ensure all figures are centre-aligned.
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE,
fig.align = "center"
)
The libraries we’ll need for today are as follows, and we’ll discuss all of these as we come across them during the session.
library(Biobase)
library(limma)
library(magrittr)
library(tidyverse)
library(ggrepel)
library(broom)
library(pander)
library(DT)
theme_set(theme_bw())
The dataset we’ll analyse is the same as last week as we’re really just working on a continuation of this analysis. A version of this pre-prepared as an ExpressionSet
is available for direct import into your VM.
eset <- url("https://github.com/UofABioinformaticsHub/transcriptomics_applications/raw/master/practicals/data/pi16_Th_eset.rds") %>%
read_rds
In our last session we used a \(T\)-test to compare between two groups, and this is effectively what we do for Differential Gene Expression Analysis. As a quick example, we just grabbed our first gene and passed it to t.test()
.
x <- exprs(eset)[1,]
df <- pData(eset) %>% mutate(x = x)
t.test(x~PI16, data = df)
However, in the above code, we are restricted to the situation where we have only two groups. In reality, we often have more than two and we need to fit this using a more complete model formula.
The most common approach to statistical analysis is to form a design matrix which contains a column for every parameter we are trying to fit, and with a row for every sample. Fortunately, our experiment is simple, in that we have two groups so our design matrix is pretty simple. This gives us a good chance to look at it though.
design <- model.matrix(~PI16, data = pData(eset))
design
This parameterisation can appear counter-intuitive to many non-statisticians, so let’s explore this a little. Our first column allows us to estimate our baseline expression level, from our reference sample group. Commonly in a linear model, the baseline is considered to be when our main predictor variable (e.g. PI16 status) has the value \(0\), and we can then fit the ‘slope’ of the line as the next column. Hence the name (Intercept)
. All we need to consider that the Intercept
is going to provide the estimate of expression in our reference sample group.
When we defined PI16
as a factor
in our initial data.frame
, we actually defined our reference group. The first factor in a categorical variable is always treated as the intercept or baseline term in this context, with everything else estimated relative to this reference group.
The important thing is that the second column allows us to estimate the difference in expression due to our second condition. Here this is when we have PI16+ cells, and it is this column which is going to provide our direct estimate of logFC.
Many people wonder why we don’t fit an expression estimate for each cell type separately, and then calculate the difference between them. This can also be done and would use the following syntax:
model.matrix(~0 + PI16, data= pData(eset))
However, this would provide estimates of expression for each cell type, and then we would have to define a second contrast matrix to compare between the two estimates. To a statistician this seems like extra work, when the same thing can be performed in one step using the first approach. However, there are many occasions where this is indeed the best approach, e.g. when we have multiple groups and pair-wise comparisons are the analysis of interest.
Let’s see how this model works using a simple linear fit on the first gene.
gene1 <- lm(x ~ PI16, data = df)
tidy(gene1)
Here we can see that the estimate of expression in our baseline cell type (PI16-) is 6.52, whilst the difference in expression for PI16= cells is -0.26. In other words, the estimated logFC due to being a PI16+ cell is -0.26. The non-significant \(p\)-value (> 0.05) indicates that we would accept \(H_0\) and consider that the true average logFC that we are estimating is indeed zero.
Let’s compare that to our t.test()
.
t.test(x ~ PI16, data = df)
Notice that our sign is reversed, but our estimates and our \(T\)=statistics are the same. Any slight difference in the \(p\)_values is likely due to alternative approaches to calculating the degrees of freedom.
One of the advantages of our type of approach here is that we are fitting the exact same statistical model to every gene. Once we have our design matrix, we pass this to a single function which will fit every gene. In limma
that function is called lmFit()
.
fit <- lmFit(eset, design)
Notice how in one simple line, we fitted all 29136 genes. And it took about one second!
If you enter the object name into the Console, you’ll get a bit of an information dump, but there are some important and recognisable things here.
coefficients
. The first column of coefficients is the estimated expression in our baseline (PI16-) cell type, whilst our second column estimates the difference for each gene within the PI16+ samples. This is our logFC estimate for PI16+ cells in comparison to PI16- cells!sigma
and this is our gene-specific estimate of the population standard deviation (also known as s
in lectures).genes
, and because we have setup our eset
object correctly, this are now included in the output here for convenience.genes
element is one called Amean
and this is the average expression level we can use to make MA plots.As mentioned in the statistics lecture, as part of the fitting process we estimate the population variance for each gene. In transcriptomics this is explicitly assumed to be the same across sample groups or cell types, and is provided in the fit
object as the element sigma
. A near compulsory step in transcriptomics (for microarray data) is to incorporate an additional step in which an Empirical Bayes model is applied to moderate these variance estimates. This reduces false positives which are due to unrealistically low estimates of variance, and provides additional power to detect genes with a large logFC, but which were ranked too low due to an excessively large variance estimate. Mathematically, this is well beyond many bioinformaticians, so let’s just leave it at the conceptual level for now. Interestingly, this process also calculates a new value for degrees of freedom for the corresponding \(t\)-test which is larger than before, as we have borrowed information from all other genes in this calculation.
To include this important step, we simply call the function eBayes()
after we’ve called fit()
. This can be done in one process using the magrittr
.
fit <- lmFit(eset, design) %>%
eBayes()
This adds an element to the fit
object called s2.post
as it is the posterior estimate (or moderated estimate) of the variance, \(\sigma^2\). If we compare these moderated variances to the original values, you’ll see the effect described above.
(Each gene’s degrees of freedom is now found by looking fit$df.residual + fit$df.prior
)
Now we’ve fitted every gene, we can get the top 10 using the simple command topTable()
. However, we’ll need to specify the model coefficient, which is simply the name of the relevant column of the design matrix.
topTable(fit, coef = "PI16+")
More practically, we can obtain the complete ranked list, and while we’re there, we can turn it into a tibble
and keep just the information we need.
results <- topTable(fit, coef = "PI16+", number = Inf) %>%
as_tibble() %>%
dplyr::select(
starts_with("gene"), logFC, AveExpr, t, P.Value, adj.P.Val
)
results
affy_id
column is worth keeping, feel freeadj.P.Val
column contains an FDR-adjusted p-valueWe can easily see how many DE genes we have to an FDR of any given value (e.g. \(\alpha = 0.05\)).
sum(results$adj.P.Val < 0.05)
We’ve already seen some of these plots as we learned ggplot2
, but now we’ll see some key plots in context.
The first thing we need to check is whether there is any bias in estimates of logFC, and the common plot for checking this is the MA
plot. (Remember the minus/add plot). Here we’ll plot our Average Expression across all samples (A) against the the average difference in expression between samples (M = logFC). Whilst we first saw these comparing two samples in two-colour microarrays, here we’re looking at the final results.
The easy version is made using plotMA
plotMA(fit)
However, this can be hard to customise, so let’s go straight back to ggplot2
. This version is a generic version which slightly improves the one from plotMA()
.
results %>%
ggplot(aes(AveExpr, logFC)) +
geom_point() +
geom_smooth(se = FALSE)
Can you see any bias? If so, what might have caused this, and would this dataset trouble you?
Now we’ve had a look at a generic plot, my standard approach might be something along these lines
results %>%
mutate(DE = adj.P.Val < 0.05) %>%
ggplot(aes(AveExpr, logFC)) +
geom_point(aes(colour = DE)) +
geom_smooth(se = FALSE) +
geom_text_repel(
aes(label = gene_name, colour = DE),
data = . %>%
dplyr::filter(DE & abs(logFC) > 2),
show.legend = FALSE
) +
scale_colour_manual(values = c("black", "red"))
Another important check is to look at your p-values. We know that under \(H_0\) p-values will have a Uniform Distribution between 0 & 1 (i.e. \(p \sim \mathcal{U}(0, 1)\)). Under \(H_A\) we generally assume that p-values will be heavily biased towards 0 (i.e. we are unlikely to see data like if \(H_0\) was true). In our data we have a combination of genes which are DE (i.e. \(H_A\) is true) and not DE (i.e. \(H_0\) is true), so we should have a distribution of p-values that looks like these two things mixed together.
results %>%
ggplot(aes(P.Value, stat(density))) +
geom_histogram(fill = "grey", colour = "black", bins = 100)
Here you can see all of the p-values near zero which contain those from genes we consider as DE (i.e. we reject \(H_0\)), whilst the rest of the p-values look like they are randomly distributed between 0 & 1. This dataset looks good, but sometimes you do see funny shapes and these can be worth paying attention to.
A common strategy is to inspect a volcano plat both before we present our final results, and as a method of presenting our final results. For a volcano plot, we put our estimated logFC on the x-axis, and a measure of significance on the y-axis. A common trick is to use our raw p-values on the -log10 scale, which places low p-values very high (\(-\log_{10}(0.001) = -\log_{10} 10^{-3} = 3\)). (NB: Don’t plot FDR-adjusted p-values unless you have a good reason to.)
Here’s an initial generic plot we can use to make decision about our results. A good idea can be to start by colouring our DE genes just to see where they are.
results %>%
mutate(DE = adj.P.Val < 0.05) %>%
ggplot(aes(logFC, -log10(P.Value))) +
geom_point(aes(colour = DE)) +
scale_colour_manual(values = c("black", "red"))
Some things to consider when inspecting these results are:
A common strategy might be to consider a gene as DE using our p-values as an initial filter, but to then apply a cutoff value to logFC. This is based on the idea that ‘biologically meaningful’ changes in expression are individual genes changing noticeably, and these key genes drive, or represent, some key biology. Whilst very common and widely used, there are two important things to consider:
There is no right answer, but these are important points to keep in mind
Let’s try applying a filter, and a very common filter is for our estimated logFC to be beyond the range \(\pm 1\). This represents a 2-fold change in expression of the mRNA under investigation, whilst in reality, any value could be chosen that seems justifiable.
results %>%
mutate(
DE = adj.P.Val < 0.05 & abs(logFC) > 1
) %>%
ggplot(aes(logFC, -log10(P.Value))) +
geom_point(aes(colour = DE)) +
geom_vline(xintercept = c(-1, 1), colour = "blue", linetype = 2) +
scale_colour_manual(values = c("black", "red"))
Does this use of a hard cutoff seem appropriate?
Can you see points which are highly ranked, but just miss out?
This is a problem that everyone is aware of, but just does their best to work with. If you’re happy with those results, we could make a final plot for our collaborators. We’re being a bit fancy here, by labelling everything above the value 6 on the y-axis (\(-\log_{10} 10^{-6} = 6\)), but we’ve added a second option for DE genes with a reasonably large fold-change. Effectively we’re labelling points around the outer edge of the plot. This is always an aesthetic call, but biologists love to see their favourite genes in these plots.
results %>%
mutate(
DE = adj.P.Val < 0.05 & abs(logFC) > 1
) %>%
ggplot(aes(logFC, -log10(P.Value))) +
geom_point(aes(colour = DE)) +
geom_text_repel(
aes(label = gene_name, colour = DE),
data = . %>%
dplyr::filter(P.Value < 1e-6 | (DE & abs(logFC) > 2)),
show.legend = FALSE
) +
geom_vline(xintercept = c(-1, 1), colour = "blue", linetype = 2) +
scale_colour_manual(values = c("black", "red"))
Some further plots we might like to make are boxplots of our favourite genes, or heatmaps of a larger groups of genes. We’ve covered these already, so will move straight on.
Another strategy available under these approaches is to down-weight lower quality samples and fit a weighted regression model. This is very dependent on the fact that we are working with Normally Distributed data. Sometimes this can be a way of including samples which aren’t terrible, but just make you uneasy (that’s a formal statistical term). It’s a pretty simple thing to calculate, and then we can add it to our sample-level metadata.
w <- arrayWeights(eset)
pData(eset)$weights <- w
Once we’ve calculated them, we should have look to see if we have any problematic samples or donors. Samples which are consistently near the mean (i.e. with small residuals) will be given high weights, whilst those which are consistently more distant from the mean will be given low weights. In a perfect experiment, each sample would have a weight of 1, so this can be a good reference value to plot.
pData(eset) %>%
ggplot(aes(PI16, w, fill = PI16)) +
geom_col() +
geom_hline(yintercept = 1, linetype = 2) +
facet_wrap(~donor, nrow = 1)
We can rerun all of our previous code (using eBayes()
or treat()
) with this simple addition to obtain slightly improved results.
weightedResults <- eset %>%
lmFit(design = design, weights = pData(.)$weights) %>%
eBayes() %>%
topTable(coef = "PI16+", number = Inf) %>%
as_tibble() %>%
dplyr::select(
starts_with("gene"), logFC, AveExpr, t, P.Value, adj.P.Val
)
Comparing this to our standard analysis (i.e. without weights) we should only see a slight change
results %>%
dplyr::filter(adj.P.Val < 0.05 & abs(logFC) > 1) %>%
nrow()
weightedResults %>%
dplyr::filter(adj.P.Val < 0.05 & abs(logFC) > 1) %>%
nrow()
If we have time, compare the estimates of logFC and the p-values (on the -log10 scale), but this may take a good 10-15 minutes of figuring out the best dplyr
and ggplot()
strategies.
In our results above, we’ve applied the classic null hypothesis and with the classic alternate hypothesis, i.e. \(H_0: \mu = 0\) against \(H_A: \mu \neq 0\). Here \(\mu\) represents the true average log fold-change.
Once we obtained our p-values and estimated our false-discovery rate, we filtered our list of potentially DE genes based on some intelligently chosen value of estimated logFC.
An alternative strategy favoured by some is to define a slightly different null hypothesis. Consider this: \[ H_0: -\lambda \leq \mu \leq \lambda \\ \text{Vs} \\ H_A: |\mu| > \lambda \]
Here we’re saying that our true mean logFC (\(\mu\)) is within a range around zero [\(-\lambda, \lambda\)] under \(H_0\), or it’s outside of this range \(|\mu| > \lambda\). All we need to do is decide on a suitable value for \(\lambda\).
This strategy is implemented in the function treat()
in the limma
package. And we would use this instead of eBayes()
. Whilst this function still calculates moderated \(t\)-statistics, the underlying hypothesis has changed and as such the moderated \(t\)-statistics will be different. The default value utilised by this function is \(\lambda = \log_2 1.2 \approx 0.263\), which provides a small range around zero for the null hypothesis. Once we have applied this, we no longer need to filter our results based on logFC as we have effectively removed results where we may have near-zero logFC but that have made significance for reasons more connected to low variability.
treatFit <- lmFit(eset, design, weights = pData(eset)$weights) %>%
treat()
treatResults <- topTreat(treatFit, n = Inf, coef = "PI16+") %>%
as_tibble() %>%
dplyr::select(
starts_with("gene"), logFC, AveExpr, t, P.Value, adj.P.Val
)
Under this approach, we wouldn’t expect any changes to the MA-plot, but we do expect changes to our other two plots.
treatResults %>%
ggplot(aes(P.Value, stat(density))) +
geom_histogram(fill = "grey", colour = "black", bins = 100)
Notice that our p-value distribution now looks very different. This fundamental change to the null hypothesis has noticeably impacted this plot, but this does make sense despite the previous claim that under \(H_0\) we expect \(p \sim \mathcal{U} (0, 1)\). Anything with an estimated logFC directly within that range will automatically be given a p-value \(\approx1\), and if we remember our initial volcano plot, there are a large number of these. Additionally, all the genes where the logFC estimate is just outside the range \(\pm \lambda\), will also receive a p-value near 1, and once again there are lot of these. Fortunately, we still see our tell-tale spike near zero, so we still have our DE genes.
It should be noted that this type of null hypothesis is not common in other statistical fields, but makes a whole lot of sense in our context.
Let’s compare our results:
results %>%
dplyr::filter(adj.P.Val < 0.05 & abs(logFC) > 1) %>%
nrow()
treatResults %>%
dplyr::filter(adj.P.Val < 0.05) %>%
nrow()
It appears we have fewer significant DE genes, so this is clearly a more conservative strategy, but it can be very useful when you have large lists of DE genes and you can’t decide where draw a hard cutoff. When we make our volcano plot, you may notice a few differences:
treatResults %>%
mutate(
DE = adj.P.Val < 0.05
) %>%
ggplot(aes(logFC, -log10(P.Value))) +
geom_point(aes(colour = DE)) +
geom_text_repel(
aes(label = gene_name, colour = DE),
data = . %>%
dplyr::filter(P.Value < 1e-6 | (DE & abs(logFC) > 1.8)),
show.legend = FALSE
) +
geom_vline(xintercept = c(-1, 1), colour = "blue", linetype = 2) +
scale_colour_manual(values = c("black", "red"))
Now we’ve looked at a few analytic approaches, the next step after obtaining a list of results, or DE genes would be to share these with our collaborators. An extremely simple strategy is to export a csv
or similar type of file. These can easily be emailed around and inspected by those addicted to the Microsoft Excel flavour of crack.
Our good old readr
functions can come into play for this. Notice that we’re going to use write_csv()
as opposed to write.csv()
. Once again, they’re very similar, but there are just a few edge cases that write_csv()
handles a little better, so this should be the function of choice.
write_csv(treatResults, "PI16Th_weightedTreat.csv")
In reality, we may also be sharing an html document we’ve generated using R markdown (example here: https://uofabioinformaticshub.github.io/20170327_Psen2S4Ter_RNASeq/).
If we’re taking the html/R Markdown approach, we might need to think about which genes we would show in a table. A simple approach might be to just grab the top 20 or so, and use pander()
to make a nice markdown table. This is very much an area we’d need to return to the tidyverse
for as well, and could use dplyr::slice()
or dplyr::filter()
depending on which set of results we’re wanting to show.
treatResults %>%
dplyr::slice(1:20) %>%
pander(
justify = "llrrrrr",
digits = 3,
caption = "Top 20 DE genes from the comparison of *PI16^+^* Th cells with *PI16^-^* Th cells"
)
An alternative might be to use the package DT
, which is able to produce searchable and sortable html tables by wrapping a whole lot of javascript, so that we don’t have to see it. Once again, this function plays very nicely with the magrittr
. Given that we will generate a sortable, searchable table, with multiple pages, we no longer need to filter()
or slice()
our results as aggressively, but might just like to include our complete set of DE genes.
treatResults %>%
mutate(DE = adj.P.Val < 0.05 & abs(logFC) > 1) %>%
dplyr::filter(DE) %>%
dplyr::select(-DE) %>%
datatable(
caption = "List of all DE genes detected in the comparison between *PI16^+^* Th cells and *PI16^-^* Th cells",
rownames = FALSE
)
This is a very handy approach as collaborators can explore the list deeply, and can search for their favourite genes. However, the obsessive amongst us might like to make this look a little neater, by formatting the numeric columns.
logFC
, AveExpr
and t
columns to only 2 decimal points using formatRound()
.P.Value
and adj.P.Val
columns to 3 significant digits using formatSignif()
.treatResults %>%
mutate(DE = adj.P.Val < 0.05) %>%
dplyr::filter(DE) %>%
dplyr::select(-DE) %>%
datatable(
caption = "List of all DE genes detected in the comparison between PI16+ Th cells and PI16- Th cells",
rownames = FALSE
) %>%
formatRound(
columns = c("logFC", "AveExpr", "t"),
digits = 2
) %>%
formatSignif(
columns = c("P.Value", "adj.P.Val"),
digits = 3
)
There are lots more tweaks for this style of output, which we won’t have time for here.
In this dataset from the last two sessions, we only had two conditions: PI16+ Vs PI16- Th cells. Both of these sample groups were Th cells, which were obtained by sorting blood samples for a specific combination of surface proteins. An additional cell type used in the complete dataset was Treg cells obtained using a different shared surface marker (CD25+), which were again split into PI16+ and PI16- cells. This would give us 4 cell types, and this presents quite a few challenges and possibilities for how we might approach our analysis. We’ll have a look at this in the next session.