Spring Into Bioinformatics

Session Themes and Outcomes
- Differential expression
Data Setup
Data QC
Differential Expression
- Voom transformation and calculation of variance weights
- Tasks

Session Themes and Outcomes

In this session we’re going to learn a few things about performing a differential expression analysis on gene expression data in R.

At the end of this session, you should be able to do the following tasks:

Read a counts table into R including:

set the required delimiter
selectively remove unwanted columns
tidy up the column names

Explore basic data parameters such as library size
Create a DGEList object
Assess sample similarity and identify outliers using Unsupervised Clustering
Perform a simple differential expression analysis

Differential expression

Now that we have our data in a nicely formatted file, we can move to R and follow a fairly standard workflow. Along the way, we’ll come across a few useful packages, data structures and coding tricks which will be applicable in many other contexts.

The packages we’ll use for this include not only the tidyverse, but a few other packages which are hosted on Bioconductor, such as limma, edgeR and AnnotationHub. We’ll also add the magrittr,pander and scales packages as they contain some useful additional utilities.

library(limma)
library(Glimma)
library(edgeR)
library(AnnotationHub)
library(tidyverse)
library(magrittr)
library(scales)
library(pander)
library(ggrepel)
library(RColorBrewer)

Data Setup

Import

First we should import the file we created using featureCounts. Please take a moment to check the location of the counts.out file you created in the last session to ensure the assignment of the location to the file object is correct.

file <- "./data/2_alignedData/featureCounts/counts.out"
rawCounts <- read_delim(file, comment = "#", delim = "\t")

This file has the gene identifiers in the first column, followed by the chromosome number, start, end, strand and length information (recall that we have subset our data to only include mouse chromosome 1), with the remaining columns containing the name of the bam file and the number of reads aligning to each gene. The first thing we might like to do is to remove the columns which don’t contain the count data we require today. After this we can tidy up those column names to remove some uninformative parts of the column names.

The columns we need are Geneid and from column 7 onwards. All of the counts columns end with bam as they have been generated from a bam file, so we can use this using dplyr::select() that we saw yesterday.

Once we’ve done this, we’ll use the function basename() to remove all of the file path information, except for the actual file name itself. Finally, we’ll introduce the function str_remove() from the stringr package, which is art of the core tidyverse. Considering that all of our bam files finish with ".trimmedAligned.sortedByCoord.out.bam", which is not particularly informative, we’ll remove this from the end of the column names.

countData <- rawCounts %>%
  dplyr::select(Geneid, ends_with("bam")) %>%
  set_colnames(basename(colnames(.))) %>%
  set_colnames(str_remove(colnames(.), ".trimmedAligned.sortedByCoord.out.bam"))

If you’d like to see some of these steps in action a bit more closely, try the following commands in the R Console (only keep them in your R Markdown notes if you think they’re helpful)

colnames(rawCounts)
colnames(rawCounts) %>% basename()
colnames(rawCounts) %>% 
  basename() %>%
  str_remove(".trimmedAligned.sortedByCoord.out.bam")

The package stringr contains many other useful functions such as str_to_upper(), str_to_lower(), str_to_title(). We’ll see a few more in action soon too.

Now our countData object looks much cleaner and we haven’t lost any important information, but we’re not quite done. The group information we need for the differential expression is locked in the sample name. Next we can split the sample ID from the group information and save that information in a tibble with our metadata for later use.

sampleGroups <- tibble(
  samplename = str_subset(colnames(countData), "SRR")
) %>%
  tidyr::separate(col = samplename, into = c("samplename", "group"), remove = TRUE)

Now we have a small table of metadata regarding our samples. We can also tidy up our countData table column names a little more by moving the gene IDs to row names and removing the group information after the sample name.

We really need to do this for the steps that are coming. The fundamental data structure we need to perform Differential Gene Expression analysis, needs gene identifiers as rownames, meaning a tibble is not really viable for this step.

countData <- countData %>%
  as.data.frame() %>%
  column_to_rownames("Geneid") %>%
  set_colnames(
    str_extract(colnames(.), "SRR[0-9]+")
  )

Regular Expressions

In this final line, we’ve seen a very important thing called a regular expression, inside the function str_extract(). This function looks inside a text string and grabs out the section that matches our string. An example might be:

parents <- c("Mother", "Father") 
str_extract(parents, "ther")

We can also modify text.

str_replace(parents, "ther", "m")

And capture text that matches patterns

str_replace(parents, "(.+)(ther)", "\\1m")

In this last line, we’ve introduced:

the single wild-card character for regular expressions: .
the symbol for match any number of the previous patterns: +
the way we “capture” text by using round braces.

In this example we’ve capture two pieces of text:

Anything (.), any number of times (+)
The pattern ther

If we’d entered "\\2m" we would’ve returned the string therm from both elements of parents.

Instead of using wild-cards, we can also restrict our wild-cards to specific values, such as [0-9] which would be any number, or [0-9]+ which is any number, any number of times. This is what we’ve done when we dropped the .cbc and .skm extensions from our sample names. We matched “SRR[0-9]+” which will capture the sample IDs, but as soon as it hits the period (.) before the cbc/skm text fragment, the match would stop, and we have used str_extract() to return only the text fragment that matches.

After our brief excursion, we can move on to setting or data up for DGE analysis.

Create a DGE List

The main object type we like to use for differential gene expression is a DGEList, which stands for Digital Gene Expression List, and is an object used by edgeR to store count data. This object is more complex than others we have looked at in R so far because it contains a number of slots for storing not only our data, but also different parameters about the data. The DGEList objects have two mandatory elements, with the first being our counts data, and the second being our sample information. Recall that above we have made the gene IDs the row names and the sample IDs the column names.

Here’s one way to create a DGEList.

dgeList <- countData %>%
    DGEList() %>%
    calcNormFactors()

Here we have taken our counts table with gene IDs as row names and sample IDs as column names and put it in to a DGEList object, called dgeList, and called the calcNormFactors() function to calculated the normalisation factors, which can be seen in dgeList$samples.

If we decided to fit our counts using a negative binomial model (for discrete data), these norm.factors would be included in the model to adjust for variations in the library size and count distributions (for an interesting post on what “fitting” a model is see fit).

We have chosen to use a different approach for our analysis today, but taking the time to include the calculation of normalisation factors step as we form the DGEList object is still good practice as it allows us to change methodology part way through our planned analysis if we determine the negative binomial model is better suited to our data.

Take a moment now to run this next code chunk and inspect the samples data frame sitting inside the DGEList object we have just created.

dgeList$samples

R understood that the column names from our counts table contained the sample IDs for our project and created the samples data frame inside the DGEList object. lib.size and norm.factors have also been calculated from the counts table and incorporated into the element. We haven’t yet included any information describing our samples. In the samples element, we can set the group variable we previously took from the sample IDs by running:

dgeList$samples$group <- sampleGroups$group %>%
    factor(levels = c("cbc", "skm"))

Take another moment now to inspect the samples data frame again and you will now see what was previously a row of 1s is now populated with the appropriate tissue type.

dgeList$samples

As a small note of warning, keep in mind that we did not at any time alter the order in which our samples were being reported. This allowed us to simply cut the information from one table and paste it into another. There are ways of using mutating joins to ensure order is never corrupted including left_join(), right_join(), inner_join() and others.

It is always worth taking the time to perform a “sanity check” to ensure the sample names and groups are correctly aligned before you continue.

Add Gene Information

A common, but optional element that we can include in a DGEList, is one called genes. This is where we can place information such as the genomic location, the gene symbol and anything else we think is going to be relevant. First, we need to find this information though, and we’ll use the package AnnotationHub which contains the metadata for all of the annotation packages included in Bioconductor.

ah <- AnnotationHub()
ah

We saw these yesterday, so hopefully they’re not as scary today. Note that we have numerous data providers, species and data classes. The column on the left (AH####) is a shorthand ID we can use to retrieve any of these annotation objects.

We can also subset the Annotation Hub object to help us find what we’re looking for. In the following code, we’re restricting our search to mouse, and specifying Ensembl as our data source.

The final line looks for an object of class EnsDb which is a database object containing a large amount of information, but in a relatively user-friendly way.

ah <- ah %>%
  subset(species == "Mus musculus") %>%
  subset(rdataclass == "EnsDb")

We have now subset ah and removed most of the information we don’t need (i.e. information related to other species and other data providers). When we inspected ah above we saw that the latest release was AH69210 | Ensembl 96 EnsDb for Mus Musculus. Today we will use this annotation record and define it as the object ensDb

ensDb <- ah[["AH69210"]]
ensDb

There are many helper functions for extracting data from this package, such as transcripts(), promoters() and genes(). We want gene information today, so let’s just use that function.

genesGR <- genes(ensDb)

This object is a GenomicRanges (or GRanges) object and these are the bread and butter of genomic analysis in R. We could spend hours just looking at these, but the main point is that on the right of the gene IDs we have the chromosome (seqnames) followed by the range of bases the genes is contained within and the strand the gene is located on (note that start and end positions are both 1-based relative to the 5’ end of the plus strand of the chromosome, even when the range is on the minus strand). This is the core GRanges element. Run the following code chunk to return this information using the function granges()

granges(genesGR)

Underlying each GRanges object is a seqinfo object and these contain all of the genomic information about chromosome names and lengths. If comparing two GRanges objects, they must have identical seqinfo objects otherwise the comparison will return an error. This actually makes perfect sense, but can create issues when comparing data from UCSC, NCBI and Ensembl as they all use different formats for their chromosome names, even though they’re based on the same assemblies.

seqinfo(genesGR)

To the right of the pipes, we have the metadata columns, accessed using the function mcols(). Notice this returns a DataFrame which is a slightly more controlled version of a data.frame. The differences are beyond the scope of this course, but they can easily be coerced to a data.frame using as.data.frame. If you want to use the dplyr functions on them, you will need to go through this step.

mcols(genesGR)

Before we place this information into the genes element of our DGEList object we will remove columns that are redundant for our purposes today, and keep only the four most useful ones. We can achieve this by calling the mcols() function on our GRanges object. Firstly we use the square brackets to subset the genesGR object we created earlier, and then by using the assignment operator (<-) to overwrite this subsetted element back into the object.

mcols(genesGR) <- mcols(genesGR)[c("gene_id", "gene_name", "gene_biotype", "entrezid")]

If we wish to add this to our DGEList, note that the order of genes will be completely different. To fix this, we will use the rownames of our DGEList object to reorder the genes object. Taking the time to ensure that the two gene lists are in the same order is important because R will give no warning or error to inform us if the two gene lists are in different orders. R will not check that these two objects are compatible.

dgeList$genes <- genesGR[rownames(dgeList)]

Here we have used square brackets [] to subset and reorder the genesGR object such that it now only contains the genes in our dgeList object, and in the same order.

Now when we subset our DGEList by gene, the genes element will also be subset accordingly and the initial relationships will be maintained. Take a moment to inspect the first four elements of our dgeList object by running the code chunk below.

dgeList[1:4,]

You will notice that the dgeList$genes element is now populated with the information from annotationHub we downloaded earlier, and that dgeList$counts and dgeList$genes are in the same order.

Data QC

Undetectable genes

You may have noticed that no reads aligned to a number of the genes contained in this dataset. As common practice we remove genes with zero counts across all samples before we continue with our initial data exploration. Rather than inspect our data one line at a time looking for zeros we can use R to perform a logical test to see how many genes were not detected in our dataset. First we’ll add up the total counts for every gene and see how many received at least one count. In the following code chunk we will take the counts element from our DGEList object and pipe it into the function rowSums() which will output a named number vector with a single value (the sum of each gene) for each gene. This vector is in turn piped into another function is_greater_than() which will ask the logical question “is the value greater than the given value”, with the given value being 0. We then pipe the TRUE/FALSE answers to this logical question into table which will provide us with a simple two number summary.

dgeList$counts %>% 
    rowSums() %>%
    is_greater_than(0) %>%
    table

Any genes that fail our test (i.e. the sum of which are not greater than zero) will be counted in the FALSE column, and any genes that pass our test (i.e. the sum of which is greater than zero) will be counted in the TRUE column. We can see that a proportion of genes were not expressed or were undetectable in our original samples. A common approach would be to remove undetectable genes using some metric, such as Counts per Million reads, also known as CPM. We could consider a gene detectable if returning more than 1CPM, in every sample, from one (usually the smaller), of the treatment groups.

Although our dataset is small (6/8 libraries are < 1e6 reads), we usually deal with libraries between 20-30million reads. In the context of the more usual library size a CPM of 1 equates to 20-30 reads aligning to a gene, which we test in every sample from a treatment group. Here our smallest group is 4 so let’s see what would happen if we applied the filter of > 1 CPM in 4 samples.

First we’ll calculate the CPM for each gene in each sample, and then we’ll apply a logical test to each value checking whether it’s greater than 1. Next, we’ll add these up for each gene (i.e. row) and this will give us the total number of samples for each gene, that passed our criteria of CPM > 1. Finally, we’ll check for genes which passed our criteria in more than 4 samples, as our smallest group is 4. We could also have used the function is_weakly_greater_than() which would test for equality ($\geq$) instead of strictly greater than ($>$).

dgeList %>%
    cpm() %>%
    is_greater_than(1) %>%
    rowSums() %>%
    is_greater_than(4) %>%
    table()

Losing about 1/3 of the genes is pretty common, so let’s now apply that to our dataset. The object genes2keep below will be a logical vector deciding on whether we keep the gene for downstream analysis, based purely on whether we consider the gene to be detectable. We’ll create a new DGEList object by subsetting our primary one using the [] trick we learned earlier. We are creating a new object here so that if we change our mind about our filtering strategy, we don’t have to rerun all the code above.

genes2keep <- dgeList %>%
    cpm() %>%
    is_greater_than(1) %>%
    rowSums() %>%
    is_greater_than(4)

dgeList_Filtered <- dgeList[genes2keep,] %>%
  calcNormFactors()

A note on normalisation

There are many different metrics used for count data normalisation including T ranscripts P er M illion mapped reads (TPM), C ounts P er M illion mapped reads (CPM) and R eads P er K ilobase per M illion mapped reads (RPKM). We have already use CPM in our anlaysis. The Counts Per Million metric takes our raw counts and performs a simple normalisation to account for differences in library size between samples. Put simply, the cpm() function we use takes the raw/estimated counts and divides each count by the sample library size before multiplying each count by 10⁶ with the equation: $CPM_i = 10^6.(\frac{X_i}{N})$ Where $X_i$ is the gene in consideration and $N$ is the sample library size. Often in RNA-seq we use TPM which is based on a standard gene size compared to RPKM which takes gene length into account. Normalisation is a bigger topic than you might think at first and is (mostly) outside of the scope of our session today. For an accessible introduction to RNA-seq normalisation you can read this blog post from Rob Patro, or this one written by Harold Pimentel.

In our session today we will be using edgeRs normalisation strategy. edgeR does not use RPKM, TPM etc. to normalise but rather has its own calcNormFactors() function.

This is because edgeR needs to normalise for:

Sequencing depth (thats what RPKM etc. deal with)
Library composition (ie different samples contain different active genes)

The process edgeR uses for normalisation (simplified) is as follows:

Step 1: Remove all undetectable genes (remove genes with zero read counts in all samples).

Step 2: Pick one sample to be the “reference sample”.

this is the sample that edgeR will use to normalise all other samples against
what makes a good reference sample? edgeR avoids using extreme samples and instead attempts to identify the most average sample
to pick a reference sample edgeR will:
- 1. Scale each sample by its total read counts.
- 1. For each sample, determine the value such that 75% of the scaled data are equal to or smaller than it
- 1. Average the 75^th quantiles
- 1. The “reference” samples is the one who’s 75^th quantile most closely matches the average

Step 3: Select the genes for calculating the scaling factors. This is done separately for each sample relative to the reference sample.

the genes used for calculating the scaling factors may be different for each sample
edgeR filters out biased genes by looking at, and then excluding genes with large log fold differences

Step 4: Calculate the weighted average of the remaining log₂ ratios.

Step 5: Convert the weighted average of log₂ values to “normal numbers”.

Step 6: Centre the scaling factors around $1$.

the centering is done by dividing each raw scaling factor by their geometric mean

More information about edgeR normalisation can be found here and here.

Let’s compare the distributions of the two datasets (cbc and skm), using CPM on the log2 scale. In the following, the command par(mfrow = c(1,2)) is a base graphics approach and sets the plotting parameters to be a multi=feature layout, with 1 row and 2 columns. We set these parameters because the convenient function plotDensities() we are calling below uses base graphics, not ggplot2.

col <- brewer.pal(nrow(dgeList$samples), "Paired")
par(mfrow = c(1,2))
dgeList %>%
    cpm(log = TRUE) %>%
    plotDensities(legend = FALSE,
                  main = "A. Before Filtering",
                  col=col)
dgeList_Filtered %>%
    cpm(log = TRUE) %>%
    plotDensities(legend = FALSE,
                  main = "B. After Filtering",
                  col=col)

Comparison of logCPM distributions before (A) and after (B) filtering for undetectable genes. Values on the x-axis represent logCPM

par(mfrow = c(1,1))

Note the peak at the left in the first plot around zero. This is all of the genes with near-zero counts. Then note that this peak is missing the second plot, confirming that we have removed most of the undetectable genes. Ideally we would like to see that the peak around zero has been removed after filtering, and that the samples all have approximately the same distribution of counts. Keeping in mind that we have significantly subset our original fastq data during alignment earlier such that we only have genes from mouse chromosome 1.

Library Sizes

Next we should check our library sizes. It does appear that the two tissue types have a different profile of reads. There could be a number of reasons for seeing differences between groups including sequencing batch effects, differences in the RNA quality from different tissues or extractions, or lab operations being performed by different staff. This is not ideal but most modelling approaches will be able to handle this.

dgeList_Filtered$samples %>%
    ggplot(aes(group, lib.size, fill = group)) +
    geom_boxplot() +
    scale_y_continuous(labels = comma) +
    labs(x = "Tissue Type", y = "Library Size") +
    theme_bw()

Library Sizes after filtering for undetectable genes.

We’ve used he function comma from the scales package here to help us interpret the y-axis. For most vertebrate datasets we expect libraries of >20million reads, but as we’ve given you a significantly reduced dataset, these numbers are pretty acceptable.

Unsupervised clustering of samples

One of the first steps in exploring our gene expression data is that of unsupervised clustering in the form of a multi-dimensional scaling plot (MDS), Principal Component Analysis (PCA) or similar. By unsupervised here we mean to say that we haven’t told R how many groups we are expecting to see, or given any information about which samples we expect to see clustering closely together. Visually inspecting the data in this way allows us to start to understand the extent to which any differential expression can be detected before we have performed any statistical tests. It may also help us to identify any outliers in our samples, or perhaps a factor we haven’t yet considered that should be included when fitting our model later.

Multi-dimensional Scaling Plot (MDS)

An important first exploration of our sequencing data is the MDS plot.

The plotMDS() function from the limma package creates a temporary log fold change between pairs and plots samples on a two-dimensional scatter-plot so that distances on the plot approximate the typical $log_2$ fold changes between the samples.

A simplified example of how we calculate the logFC for our MDS plot.

For a given counts table eg.

	sample_1	sample_2	sample_3
Gene_1	3.0	0.25	2.8
Gene_2	2.9	0.8	1.0
Gene_3	2.2	1.0	1.5

….

We take our gene counts and calculate the logFC between each pair of genes

logFC Gene_1: sample_1 versus sample_2 $= log(\frac{3}{.25}) = 3.58$
logFC Gene_2: sample_1 versus sample_2 $= log(\frac{2.9}{.8}) = 1.86$
Continue calculating the logFC between each pair of genes

Lastly, take the average of the absolute value of the logFC among genes

NB: we take the absolute value so the negative log fold changes don’t cancel out the positive ones

We can now use these distances between samples to plot our MDS.

Thankfully we don’t need to perform all of these calculations by hand, or even write our own script to do so

plotMDS(cpm(dgeList_Filtered, log = TRUE))
title(main = "Multi Dimensional Scaling Plot \nMouse Chromosome 1")

MDS showing two clear groups in the data. limma::plotMDS

Here we see that the Leading logFC on dimension 1 appears to separate the samples by tissue type suggesting that as we set out to investigate - there is a difference in gene expression between cerebellar cortex (cbc) and skeletal muscle (skm).

WARNING: it is easy to observe the difference between our groups on the MDS plot and assign them directly to tissue type. The files we are using here are part of a much larger research effort and as such there is much information not being assessed. e.g.. were our samples all sequenced at the same time, on the same machine? The differences we observe could be due to something we call batch effects.

Batch effects can be defined as systematic non-biological variation between groups of samples (or batches) due to experimental artefacts. Batch effects are a complication of high-throughput studies which occur when measurements are affected by laboratory conditions. The topic of batch effects and how to account for them deserves its own session but we don’t have the time to delve into that here. If you are interested in reading more about batch effects and how to overcome them the rna-seqblog has a number of interesting articles discussing the topic.

We can use the plotMDS() function directly to create an MDS plot as we did above, or assign the data created to an object that we can pipe into ggplot for a presentation ready figure.

mds <- dgeList_Filtered %>%
  cpm(log = TRUE) %>%
  plotMDS()

yes

The following chunk wraps the plotting of our MDS plot inside plotly::ggplotly() which takes a ggplot object and makes it interactive. You may notice that label was included as a plotting aesthetic, but no labels were added as a layer in ggplot2. These will instead be added once the plot is made interactive (try hovering the mouse over one of the points on the plot to see what we mean by interactive). Interactive plots will only be interactive in the Viewer tab or when compiling an R Markdown document to an html format, and will remain static if rendering to an MS Word document or pdf.

plotly::ggplotly(
  as.data.frame(cbind(dim1=mds$x,dim2= mds$y)) %>%
        rownames_to_column("sample") %>%
        as_tibble() %>%
        left_join(rownames_to_column(dgeList_Filtered$samples, "sample")) %>%
        ggplot(aes(x = dim1, y = dim2,
                   colour = group,
                   label = sample)) +
    xlab("Leading logFC Dim 1") +
    ylab("Leading logFC Dim 2") +
        geom_point(size = 3) +
        theme_bw()
)

MDS showing two clear groups in the data

In the spirit of interactive plots available in R, run the following code chunk to create an interactive MDS plot as an HTML page using Glimma::glMDSPlot. If you set launch = TRUE this code chunk will open a new browser window containing an MDS plot in the left panel and a bar plot showing the proportion of variation explained by each dimension in the right panel.

dgeList_Filtered %>%
  cpm(log=TRUE) %>%
  glMDSPlot(labels=rownames(dgeList_Filtered$samples),
            groups=dgeList_Filtered$samples$group,
            launch=TRUE)

Tasks

Try clicking on the bars of the bar plot to change the pair of dimensions being plotted on the MDS.
What might account for the variation between dimension 2 and dimension 3?

PCA

Next we might choose to perform a Principal Component Analysis on our data, commonly abbreviated to PCA. This time, let’s take our CPM values and asses them on the log2 scale to make sure our PCA results are not heavily skewed by highly expressed genes. Our first step is to create a new object to hold the PCA results. In the following code chunk we take our filtered DGEList object dgeList_Filtered and pipe it to the cpm() function we used earlier to calculate the CPM values for each gene prior to setting the filtering threshold. We have made one small change here by including $log - TRUE$ in the function call. By default this is $log_2$. Next we pipe the $log2$CPM values into the t() function. This will transpose the matrix of counts such that the matrix now has genes in the columns and samples in the rows, which we then pipe into prcomp from the stats package. prcomp() returns a list with class “prcomp” containing a number of useful components.

To find out more about prcomp try typing help("prcomp") into the console. This will open the R Documentation for this function in the Help tab.

pca <- dgeList_Filtered %>%
    cpm(log = TRUE) %>%
    t() %>%
    prcomp()

In our DGEList, we have the genes as the variables of interest for our main analysis, however for the PCA we’re looking at out samples as the variables of interest. The third line in the above code chunk has transposed (t()) the matrix returned by cpm(log = TRUE) to place the samples as the rows, which is where the function prcomp() expects to see the variables of interest.

Run the next code chunk for a quick inspection of the results shows that the first two components capture most of the variability, as expected. Beyond this observation, the details of PCA are beyond what we can cover here.

summary(pca)$importance %>%
  pander(split.tables = Inf)

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8
Standard deviation	43.5	9.545	6.458	6.267	5.206	4.883	3.982	1.581e-14
Proportion of Variance	0.8879	0.04274	0.01956	0.01842	0.01272	0.01119	0.00744	0
Cumulative Proportion	0.8879	0.9307	0.9502	0.9687	0.9814	0.9926	1	1

Just as we did with the MDS plot, we can plot our PCA results to see if samples group clearly with their tissue type based on our main two principal components (i.e. PC1 and PC2). Much the same as the MDS plot, any clear separation of our groups of interest can be considered a positive sign that we will find differentially expressed genes.

plotly::ggplotly(
    pca$x %>%
        as.data.frame() %>%
        rownames_to_column("sample") %>%
        as_tibble() %>%
        dplyr::select(sample, PC1, PC2) %>%
        left_join(rownames_to_column(dgeList_Filtered$samples, "sample")) %>%
        ggplot(aes(x = PC1, y = PC2,
                   colour = group,
                   label = sample)) +
        geom_point(size = 3) +
        theme_bw()
)

PCA showing two clear groups in the data

Multi-Dimensional Scaling can look similar to PCA, but asks a slightly different question of the data. PCA creates plots based on correlations between samples, MDS creates plots based on distances between samples (here logFC between genes). Both methods return the results: 1. Coordinates for a graph 2. Percent of variation each axis accounts for 3. Loading scores (to determine which variables have the largest effect)

It is common to perform one or both of these analyses before starting a differential expression analysis but rest assured that usually results found under both methods will reveal similar patterns in your data.

For a great tutorial on multi-dimensional scaling check out “StatQuest: MDS and PCoA” here, and a companion video about PCA titled “StatQuest: Principal Component Analysis (PCA), Step-by-Step” here.

Tasks

How much variation is accounted for in PC1 and PC2 combined?
Using the code chunk above as a template, plot PC3 and PC4 using ggplot and wrapping with plotly::ggplotly to investigate other sources of variation.

Differential Expression

In the previous sections we have worked with read counts, which are a discrete value and formally cannot be modelled using the assumption of normally distributed data (recall that counts data typically approximate a negative binomial distribution, not a normal distribution).

The failure to meet the assumption of normality rules out linear models and t-tests for RNA-Seq, so many packages have been developed which use the negative binomial distribution to model these counts (e.g. edgeR and DESeq2).

An alternative was proposed by Law et al., where they apply a system of weights to the counts which allow the assumption of normality to be applied. This method is called voom and we’ll use this today.

Voom transformation and calculation of variance weights

voom() is a function in the limma package that modifies RNA-Seq data for use with limma. First, lets set up a model matrix to specify the model to be fitted.

model_matrix <- model.matrix(~ + group, data = dgeList_Filtered$samples)

The above specifies a model where each coefficient corresponds to a group mean. Importantly we have used the samples element of the DGEList object to set the model matrix. By using the DGEList object we ensure that the samples (and therefore the group assignment) has been given in the same order as the counts table.

Tasks

Take a moment to inspect the model_matrix object. You can do this by simply typing the object name into the console and hitting the Enter key.
What information has been added here?
What could happen if the group assignment wasn’t given in the same order as the counts table?

Next we will create our voom object using the limma::voom() function.

voomData <- voom(dgeList_Filtered, model_matrix, plot = TRUE)

yes

Note that running the voom() function without the model_matrix object will add a design matrix to the data based on the group column of the DGEList object but which we chose to specify explicitly. This is something of a style choice, but either way we can now use the voom object to perform a simple linear regression on each gene, which amounts to a t-test in this dataset.

From here it’s a simple matter to code the analysis and inspect results.

A lot of analysis is performed in the following code chunk:

We’ll use lmFit() to fit the model for every gene using the design matrix in voomData. Note that we’re not using CPM, but are instead fitting the counts directly, after incorporation of the voom-derived weights.
Next we moderate the variances using an empirical Bayes approach. This uses the assumption that our variance estimates will be a mix of overestimates and underestimates, and shrinks them all towards a central value. As a strategy, this is widely accepted and has been shown to increase power and reduce false positives.
Finally, we’ll create a list of all genes with a summary of the results which we then convert to a tibble. The last step will remove the gene IDs (i.e. the row names), but because we’ve included our gene information in the DGEList object, this will be added to the results and we’ll still know which gene is which (now we see why we needed to ensure the genesGR object was added in the same order as the countData object).

topTable <- voomData %>% 
    lmFit() %>%
    eBayes %>%
    topTable(coef = "groupskm", n = Inf) %>%
    as_tibble()

When we call the coefficient groupskm in the topTable function we are using the column name to specify which coefficient or contrast of the linear model is of interest

Note that the GRanges information has been coerced into columns to form a data.frame/tibble and that looks a bit messy. To tidy things up, the following code will join all of the genomic co-ordinates into a single column, rename a few columns on the fly and by not asking for ID.width and ID.gene_biotype, we’ve saved ourselves dealing with a few redundant columns. We could actually add this to the chunk above to save forming then modifying an R object, but it’s been left separate for clarity.

Tasks

Take a quick look at the topTable object before and after the tidy up by typing topTable %>% head() into the console and hitting the Enter key.

NB: We have used a very handy dplyr::select() option here to rename the columns ID.gene_id and ID.gene_name.

topTable <- topTable %>%
    unite("Range", ID.start, ID.end, sep = "-") %>%
    unite("Location", ID.seqnames, Range, ID.strand, sep = ":") %>%
    dplyr::select(Geneid = ID.gene_id, 
                  Symbol = ID.gene_name,
                  AveExpr, logFC, t, P.Value, 
                  FDR = adj.P.Val, 
                  Location, 
                  Entrez = ID.entrezid)

Now that we have our ranked list of genes, we should really inspect these visually. A commonly used plot is a volcano plot, where the position of values on the x-axis is the logFC estimates for each gene, and the position on the y-axis relates to the strength of statistical significance for each gene. Before plotting, we added a simple column called DE to indicate whether we considered a gene to be differentially expressed (i.e.. the mean between the groups is different), based purely on an FDR-adjusted p-value < 0.05. To add some colour to the plot (and to make identifying the DE genes easy) we can use this DE status to colour points.

topTable %>%
    mutate(DE = FDR < 0.05) %>%
    ggplot(aes(logFC, -log10(P.Value), colour = DE)) +
    geom_point(alpha = 0.5) +
    geom_text_repel(data = . %>%
                        dplyr::filter(DE) %>%
                        dplyr::filter(-log10(P.Value) > 4 & abs(logFC) > 5),
                    aes(label = Symbol)) +
    scale_colour_manual(values = c("grey", "red")) +
    theme_bw() +
    theme(legend.position = "none")

Volcano plot showing DE genes between the two tissue types.

So what have we done here. Firstly we took our topTable object that contains the logFC values and FDR for each gene and piped the object into mutate. Here we have used mutate to create a new column called DE based on the logical question is the value in the FDR column less than 0.05?. For each row in our table if the value of the FDR column is < 0.05 a TRUE value will be returned, and of course a value of FALSE if $\geq$ 0.05. We have used geom_text_repel() to add gene names to the volcano plot - and we have used dplyr::filter() inside the call to add the gene names such that we will only add labels if the gene meets both a P.Value and logFC threshold. As another perspective, we might like to see whether these genes are at the high or low end of the range for expression values. This is often referred to as an MD plot, which stands for M ean expression vs D ifference, where difference is more commonly referred to as logFC. We’ve added an arrange() call in the following to ensure our DE genes are plotted last and are the most visible in our figure.

topTable %>%
    mutate(DE = FDR < 0.05) %>%
    arrange(desc(P.Value)) %>%
    ggplot(aes(AveExpr, logFC, colour = DE)) +
    geom_point(alpha = 0.5) +
    geom_text_repel(data = . %>%
                        dplyr::filter(DE) %>%
                        dplyr::filter(abs(logFC) > 5 & AveExpr > 8),
                    aes(label = Symbol)) +
    scale_colour_manual(values = c("grey", "red")) +
    labs(x = "Average Expression (log2 CPM)",
         y = "log Fold-Change") +
    theme_bw() +
    theme(legend.position = "none")

Mean-Difference plot showing fold-change potted against expression level. Genes considered as DE are highlighted in red.

If were happy with our results, we could export this list using write_csv() or write_tsv(). We could also produce a short table summarising the “big guns” in the dataset. As there are more than 100 genes considered as DE here, let’s restrict to those with logFC beyond the range $\pm2.3$, which equates to 5-fold up or down-regulation.

topTable %>%
    dplyr::filter(FDR < 0.05, abs(logFC) > 5) %>%
    dplyr::select(ID = Geneid, Symbol, AveExpr, logFC, P.Value, FDR) %>%
    pander(caption = paste("The", nrow(.), "most differentially expressed genes when ranked by p-value, and filtered on a logFC beyond the range $\\pm2.3$"))

The 97 most differentially expressed genes when ranked by p-value, and filtered on a logFC beyond the range $\pm2.3$
ID	Symbol	AveExpr	logFC	P.Value	FDR
ENSMUSG00000026208	Des	11.62	9.948	5.251e-13	5.849e-10
ENSMUSG00000015222	Map2	10.06	-6.978	6.332e-12	3.527e-09
ENSMUSG00000042751	Nmnat2	8.348	-6.849	4.473e-11	7.807e-09
ENSMUSG00000055567	Unc80	9.556	-7.245	9.264e-11	8.6e-09
ENSMUSG00000026407	Cacna1s	9.395	11.18	6.478e-11	8.019e-09
ENSMUSG00000014602	Kif1a	10.92	-7.154	2.819e-10	1.33e-08
ENSMUSG00000067653	Ankrd23	8.491	8.877	1.293e-10	1.037e-08
ENSMUSG00000025964	Adam23	9.513	-5.43	2.806e-10	1.33e-08
ENSMUSG00000026442	Nfasc	9.458	-6.686	2.413e-10	1.33e-08
ENSMUSG00000040723	Rcsd1	8.918	5.713	2.985e-10	1.33e-08
ENSMUSG00000079588	Tmem182	7.863	10.34	7.905e-11	8.438e-09
ENSMUSG00000048814	Lonrf2	8.134	-7.265	2.407e-10	1.33e-08
ENSMUSG00000026100	Mstn	7.204	7.413	2.336e-10	1.33e-08
ENSMUSG00000026080	Chst10	6.834	-6.432	3.815e-10	1.635e-08
ENSMUSG00000026443	Lrrn2	7.763	-5.845	8.927e-10	3.014e-08
ENSMUSG00000026424	Gpr37l1	8.847	-6.799	1.254e-09	3.581e-08
ENSMUSG00000042182	Bend6	8.552	-6.814	2.064e-09	5.186e-08
ENSMUSG00000004110	Cacna1e	7.52	-6.699	1.662e-09	4.41e-08
ENSMUSG00000028354	Fmn2	6.8	-6.646	1.515e-09	4.22e-08
ENSMUSG00000038576	Susd4	7.378	-6.272	2.338e-09	5.426e-08
ENSMUSG00000050711	Scg2	8.065	-6.872	2.962e-09	6.16e-08
ENSMUSG00000090071	Cdk5r2	6.864	-7.819	2.694e-09	6.002e-08
ENSMUSG00000026163	Sphkap	9.246	-7.221	5.351e-09	1.029e-07
ENSMUSG00000026204	Ptprn	8.457	-6.644	5.356e-09	1.029e-07
ENSMUSG00000045174	Amer3	5.608	-7.353	2.653e-09	6.002e-08
ENSMUSG00000026458	Ppfia4	8.719	-6.609	8.777e-09	1.509e-07
ENSMUSG00000044708	Kcnj10	9.078	-7.666	1.075e-08	1.687e-07
ENSMUSG00000026494	Kif26b	8.5	-5.186	1.459e-08	2.065e-07
ENSMUSG00000103653	Gstp-ps	4.914	6.844	8.937e-09	1.509e-07
ENSMUSG00000054976	Nyap2	5.214	-6.566	9.576e-09	1.569e-07
ENSMUSG00000026587	Astn1	7.39	-7.425	1.45e-08	2.065e-07
ENSMUSG00000041670	Rims1	7.78	-7.823	1.632e-08	2.245e-07
ENSMUSG00000025931	Paqr8	9.067	-6.249	2.417e-08	3.025e-07
ENSMUSG00000049866	Arl4c	8.198	-5.635	2.52e-08	3.096e-07
ENSMUSG00000050777	Tmem37	7.741	7.69	2.685e-08	3.251e-07
ENSMUSG00000033740	St18	6.468	-8.297	2.967e-08	3.516e-07
ENSMUSG00000033061	Resp18	6.411	-6.652	4.624e-08	4.906e-07
ENSMUSG00000058248	Kcnh1	6.888	-6.55	5.76e-08	5.484e-07
ENSMUSG00000026459	Myog	4.291	7.121	4.476e-08	4.841e-07
ENSMUSG00000026188	Tmem169	4.46	-5.837	4.724e-08	4.91e-07
ENSMUSG00000025909	Sntg1	4.644	-6.214	5.085e-08	5.149e-07
ENSMUSG00000026574	Dpt	5.883	8.955	5.621e-08	5.398e-07
ENSMUSG00000103375	Gm37181	4.431	-5.79	5.393e-08	5.316e-07
ENSMUSG00000036815	Dpp10	7.336	-6.895	7.079e-08	6.21e-07
ENSMUSG00000070695	Cntnap5a	5.696	-6.351	6.361e-08	5.906e-07
ENSMUSG00000044835	Ankrd45	5.333	-6.449	6.182e-08	5.837e-07
ENSMUSG00000033569	Adgrb3	7.003	-6.858	7.989e-08	6.953e-07
ENSMUSG00000026387	Sctr	4.58	7.138	6.415e-08	5.906e-07
ENSMUSG00000047496	Rnf152	7.677	-5.746	9.116e-08	7.522e-07
ENSMUSG00000051703	Tmem198	5.383	-6.538	7.077e-08	6.21e-07
ENSMUSG00000053024	Cntn2	8.626	-7.908	9.236e-08	7.523e-07
ENSMUSG00000038026	Kcnj9	7.026	-7.469	8.462e-08	7.234e-07
ENSMUSG00000045515	Pou3f3	6.77	-6.507	8.744e-08	7.372e-07
ENSMUSG00000016200	Syt14	6.002	-5.473	8.507e-08	7.234e-07
ENSMUSG00000026308	Klhl30	7.155	10.06	8.801e-08	7.372e-07
ENSMUSG00000032908	Sgpp2	7.097	-7.49	9.252e-08	7.523e-07
ENSMUSG00000034000	Neu4	4.311	-5.543	8.142e-08	7.031e-07
ENSMUSG00000026347	Tmem163	6.129	-7.211	1.003e-07	7.922e-07
ENSMUSG00000026344	Lypd1	5.368	-5.454	9.959e-08	7.922e-07
ENSMUSG00000025927	Tfap2b	6.074	-7.315	1.114e-07	8.56e-07
ENSMUSG00000054203	Ifi205	4.318	7.176	1.124e-07	8.577e-07
ENSMUSG00000026312	Cdh7	6.097	-6.314	1.548e-07	1.117e-06
ENSMUSG00000026452	Syt2	9.329	-8.204	2.235e-07	1.465e-06
ENSMUSG00000026527	Rgs7	6.452	-5.916	2.091e-07	1.403e-06
ENSMUSG00000026564	Dusp27	6.778	8.369	2.156e-07	1.43e-06
ENSMUSG00000016179	Camk1g	5.747	-5.601	2.123e-07	1.417e-06
ENSMUSG00000037509	Arhgef4	7.776	-6.028	2.428e-07	1.56e-06
ENSMUSG00000097823	Gm16701	4.274	-5.472	1.977e-07	1.351e-06
ENSMUSG00000026018	Ica1l	5.735	-5.071	2.338e-07	1.523e-06
ENSMUSG00000061816	Myl1	9.023	15.45	2.566e-07	1.624e-06
ENSMUSG00000050840	Cdh20	5.432	-6.197	2.788e-07	1.745e-06
ENSMUSG00000100147	1700047M11Rik	5.609	-5.298	2.959e-07	1.831e-06
ENSMUSG00000035131	Brinp3	5.209	-5.951	2.86e-07	1.78e-06
ENSMUSG00000036766	Dner	8.811	-7.813	4.011e-07	2.306e-06
ENSMUSG00000045648	Vwc2l	4.106	-5.142	3.673e-07	2.176e-06
ENSMUSG00000041144	Dnah7b	4.616	-5.364	3.812e-07	2.235e-06
ENSMUSG00000067879	Vxn	5.115	-5.984	5.531e-07	2.991e-06
ENSMUSG00000050967	Creg2	5.695	-6.74	6.025e-07	3.211e-06
ENSMUSG00000007122	Casq1	10.11	11.18	7.955e-07	4.084e-06
ENSMUSG00000015843	Rxrg	7.276	5.709	8.986e-07	4.469e-06
ENSMUSG00000057715	A830018L16Rik	4.971	-6.082	1.015e-06	4.982e-06
ENSMUSG00000042671	Rgs8	9.099	-8.621	1.337e-06	6.284e-06
ENSMUSG00000026514	Cnih3	4.281	-5.499	1.326e-06	6.261e-06
ENSMUSG00000042451	Mybph	6.858	10.73	1.528e-06	7.033e-06
ENSMUSG00000026697	Myoc	6.487	7.088	1.744e-06	7.896e-06
ENSMUSG00000103329	Gm42492	4.391	-5.353	2.068e-06	9.142e-06
ENSMUSG00000040113	Mettl11b	4.672	6.456	2.777e-06	1.168e-05
ENSMUSG00000047343	Mettl21c	5.3	7.24	3.258e-06	1.325e-05
ENSMUSG00000051985	Igfn1	6.775	5.024	5.244e-06	2.001e-05
ENSMUSG00000026580	Selp	3.623	5.158	8.963e-06	3.253e-05
ENSMUSG00000092085	Gm17224	5.275	6.015	9.622e-06	3.456e-05
ENSMUSG00000039323	Igfbp2	6.235	-5.506	3.058e-05	9.332e-05
ENSMUSG00000073490	Ifi207	5.422	6.554	3.337e-05	0.0001007
ENSMUSG00000015829	Tnr	8.667	-5.409	2.697e-05	8.321e-05
ENSMUSG00000006542	Prkag3	8.362	5.342	5.042e-05	0.000144
ENSMUSG00000019230	Lhx9	4.108	-5.223	0.001683	0.003341
ENSMUSG00000026628	Atf3	7.197	5.508	0.01451	0.02353

We can also take a look at our genes of interest one at a time. Because we have two groups (i.e.. our two tissue types) a simple box plot provides a good summary of our count data.

Here we take a gene of interest “ENSMUSG00000033021”, and produce a box plot of expression values split by tissue type. I have considered “ENSMUSG00000033021” to be interesting because it is both significantly differentially expressed, and highly expressed in the table we created above.

plotly::ggplotly(
dgeList_Filtered$counts %>%
  as.data.frame() %>%
  tibble::rownames_to_column() %>%
  dplyr::filter(., rowname %in% "ENSMUSG00000007122") %>%
  tibble::column_to_rownames() %>%
  t() %>%
  as.data.frame() %>%
  tibble::rownames_to_column("samplename") %>%
  left_join(., sampleGroups, by = "samplename") %>%
  ggplot() +
  geom_boxplot(aes(x = group, y = ENSMUSG00000007122,
                   fill = group)) +
  theme_bw()
)

Box plot showing mean differences in our gene of interest between tissue types.

Tasks

Use the table we created above to choose another gene of interest (or two or three) and create your own box plots based on the code chunk above.
Do the box plots you create make sense in the context of the comparison we set for the differential expression analysis?

Spring Into Bioinformatics

Differential expression

Melanie Smith

3rd October, 2019

Session Themes and Outcomes

Differential expression

Data Setup

Import

Regular Expressions

Create a DGE List

Add Gene Information

Data QC

Undetectable genes

A note on normalisation

Library Sizes

Unsupervised clustering of samples

Multi-dimensional Scaling Plot (MDS)

Tasks

PCA

Tasks

Differential Expression

Voom transformation and calculation of variance weights

Tasks

Tasks

Tasks