Now that we have our data in a nicely formatted file, we can move to R and follow a fairly standard workflow. Along the way, we’ll come across a few useful packages, data structures and coding tricks which will be applicable in many other contexts.

The packages we’ll use for this include not only the tidyverse, but a few other packages which are hosted on Bioconductor, such as limma, edgeR and AnnotationHub. We’ll also add the magrittr,pander and scales packages as they contain some useful additional utilities.

library(limma)
library(edgeR)
library(AnnotationHub)
library(tidyverse)
library(magrittr)
library(scales)
library(pander)
library(ggrepel)

Data Setup

Import

First we should import the file we’ve created using featureCounts.

counts <- read_tsv("../2_alignedData/counts/genes.out")

This file has the gene identifiers in the first column, with the remaining columns containing the name of the bam file and the number of reads aligning to each gene. The first thing we might like to do is tidy up those column names.

colnames(counts) <- colnames(counts) %>%
    basename() %>%
    str_remove("_(10|23)_(03|04)_(2014|2016)_S[0-9]_fem_Aligned.+")

That looks much cleaner and we haven’t lost any important information.

Create a DGE List

The main object type we like to use for differential gene expression is a DGEList, which stands for Digital Gene Expression List. These objects have two mandatory elements, with the first being our counts and the second being our samples. In these objects, we can consider the gene IDs to be the row names and the sample names are the column names. Here’s one way to create a DGEList.

dgeList <- counts %>%
    as.data.frame() %>%
    column_to_rownames("Geneid") %>%
    DGEList() %>%
    calcNormFactors()

In the first 3 lines, we’re just setting the gene IDs as the row names instead of being a column. From there we’ve turned the remaining columns (i.e. the counts) into a DGEList object, and calculated the normalisation factors, which can be seen in dgeList$samples If fitting counts using a negative binomial model (for discrete data), these norm.factors are included in the model to adjust for variations in the library size and count distributions. We’ll use a different approach for our analysis today, but doing this as we form the object is still good practice, in case we change tack halfway through our planned analysis.

dgeList$samples

In the samples element, we can also set the group variable so let’s put our time-points here.

dgeList$samples$group <- colnames(dgeList) %>%
    str_extract("(6|24)mth") %>%
    factor(levels = c("6mth", "24mth"))

Add Gene Information

A common, but optional element that we can include in a DGEList is one called genes. This is where we can place information such as the genomic location, the gene symbol and anything else we think is going to be relevant. First, we need to find this information though, and we’ll use the package AnnotationHub which contains the metadata for all of the annotation packages included in Bioconductor.

ah <- AnnotationHub()
ah

This structure is quite advanced, but if you just type in the object name, you’ll get an informative summary of all available annotation packages. Note that we have numerous data providers, species and data classes. The column on the left is also a shorthand ID we can use to retrieve any of these annotation objects.

We can also subset the Annotation Hub object to help us find what we’re looking for. In the following code, we’re restricting our search to zebrafish, with data held sourced from Ensembl. The final line looks for an object of class EnsDb which is a database object containing a large amount of information, but in a relatively user-friendly way.

ah %>%
    subset(species == "Danio rerio") %>%
    subset(dataprovider == "Ensembl") %>%
    subset(rdataclass == "EnsDb")

The last of these objects contains data from release 94, which will be close enough to match the data we have generated so far (which was Ensembl 95). Ideally, these should be the same, but there is no EnsDb object for Ensembl 95. I’ sure one will be made available in the next release of Bioconductor. Let’s use this annotation object and define it as the object ensDb

ensDb <- ah[["AH64906"]]
ensDb

There are many helper functions for extracting data from this package, such as transcripts(), promoters() and genes(). We want gene information today, so let’s just use that function. Note that after we obtain the data, we’re using subset() to only keep the data from chromosome 2.

genesGR <- genes(ensDb) %>%
    subset(seqnames == 2)

This object is a GenomicRanges (or GRanges) object and these are the brad & butter of genomic analysis in R. We could spend hours just looking at these, but the main point is that on the left of the “pipes” we have the chromosome (seqnames) followed by the range of bases the genes is contained within and the strand the gene is located on. This is the core GRanges element, and we could simply return this information using the function granges()

granges(genesGR)

Underlying each GRanges object is a seqinfo object and these contain all of the genomic information about chromosome names and lengths. If comparing two GRanges objects, they must have identical seqinfo objects otherwise the comparison will return an error. This actually makes perfect sense, but can create issues when comparing data from UCSC, NCBI and Ensembl as they all use different formats for their chromosome names, even though they’re based on the same assemblies.

seqinfo(genesGR)

To the right of the pipes, we have the metadata columns, accessed using the function mcols(). Notice this returns a DataFrame which is a slightly more controlled version of a data.frame. The differences are beyond the scope of this course, but they can easily be coerced to a data.frame using as.data.frame. If you want to use the dplyr functions on them, you will need to go through this step.

mcols(genesGR)

When we place this information into our DGEList object, some of those columns look a bit redundant, so let’s just keep four of the most useful ones.

mcols(genesGR) <- mcols(genesGR)[c("gene_id", "gene_name", "gene_biotype", "entrezid")]

If we wish to add this to our DGEList, note that the order of genes will be completely different. To fix this, we just use the rownames of our DGEList to reorder the genes object. We can also just create the new element $genes by simply typing it & providing a value. R will not check that these two objects are compatible, so you have to be on your toes here.

dgeList$genes <- genesGR[rownames(dgeList),]

Now when we subset our DGEList by gene, the genes element will also be subset accordingly and the initial relationships will be maintained.

dgeList[1:4,]

Data QC

Undetectable genes

As you may have noticed, no reads aligned to the 4th gene in this dataset so we should really remove it from the data. There are probably many more genes in this boat too. Let’s do a logical test to see how many genes were not detected in our dataset. First we’ll add up the total counts for every gene and see how many received at least one count.

dgeList$counts %>% 
    rowSums() %>%
    is_greater_than(0) %>%
    table()

Clearly, a good proportion of our genes were not expressed in our original samples. A common approach would be to remove undetectable genes using some metric, such as Counts per Million reads, known as cpm. We could consider a gene detectable if returning more than 1CPM in every sample from one of the treatment groups. Although our dataset is small (all libraries are < 1e6 reads), we usually deal with libraries between 20-30million reads, and this would equate to 20-30 reads aligning to a gene, in every sample from a treatment group. Here our smallest group is 3 so let’s see what would happen if we applied that filter.

First we’ll calculate the cpm for each gene in each sample, and then we’ll apply a logical test to each value checking whether it’s greater than 1. Next, we’ll add these up for each gene (i.e. row) and this will give us the total number of samples for each gene, that passed our criteria of \(cpm > 1\). Finally, we’ll check for genes which passed our criteria in more than 3 samples, as our smallest group is 3. We could also have used is_weakly_greater_than() which would test for equality (\(\geq\)) instead of strictly greater than (\(>\)).

dgeList %>%
    cpm() %>%
    is_greater_than(1) %>%
    rowSums() %>%
    is_greater_than(3) %>%
    table()

Losing about 1/3 of the genes is pretty common, so let’s now apply that to our dataset. The object genes2keep below will be a logical vector deciding on whether we keep the gene for downstream analysis, based purely on whether we consider the gene to be detectable. We’ll create a new DGEList object by subsetting our primary one. This way if we change our mind about our filtering strategy, we don’t have to rerun all the code above.

genes2keep <- dgeList %>%
    cpm() %>%
    is_greater_than(1) %>%
    rowSums() %>%
    is_greater_than(3)
dgeFilt <- dgeList[genes2keep,] %>% calcNormFactors()

Let’s compare the distributions of the two datasets, using cpm on the log2 scale. In the following the command par(mfrow = c(1,2)) is a base graphics approach and sets the plotting parameters to be a multi=feature layout, with 1 row and 2 columns. We’ve done this because the convenient function plotDensities() uses base graphics not ggplot2. It’s nowhere near as pretty.

par(mfrow = c(1,2))
dgeList %>%
    cpm(log = TRUE) %>%
    plotDensities(legend = FALSE, main = "A. Before Filtering")
dgeFilt %>%
    cpm(log = TRUE) %>%
    plotDensities(legend = FALSE, main = "B. After Filtering")
Comparison of logCPM distributions before and after filtering for undetectable genes. Values o the x-axis represent logCPM

Comparison of logCPM distributions before and after filtering for undetectable genes. Values o the x-axis represent logCPM

par(mfrow = c(1,1))

Note the peak at the left in the first plot around zero. This is all of the genes with near-zero counts. Then note that this peak is missing the second plot, confirming that we have removed most of the undetectable genes.

Library Sizes

Next we should check our library sizes. It does appear that these being prepared on different days has given one of our groups more reads. This is not ideal but most modelling approaches will be able to handle this.

dgeFilt$samples %>%
    ggplot(aes(group, lib.size, fill = group)) +
    geom_boxplot() +
    scale_y_continuous(labels = comma) +
    labs(x = "Timepoint", y = "Library Size") +
    theme_bw() 
Library Sizes after filtering for undetectable genes.

Library Sizes after filtering for undetectable genes.

We’ve used he function comma from the scales package here to help us interpret the y-axis. For most vertebrate datasets, you’d expect >20million reads, but as we’ve given you a significantly reduced dataset, these numbers are pretty acceptable.

PCA

Next we might choose to perform a Principal Component Analysis on our data, commonly abbreviated to PCA. This time, let’s take our CPM values & asses them on the log2 scale to make sure our PCA results are not heavily skewed by highly expressed genes.

pca <- dgeFilt %>%
    cpm(log = TRUE) %>%
    t() %>%
    prcomp() 

In our DGEList, we have the genes as the variables of interest for our main analysis, however for the PCA we’re looking at out samples as the variables of interest. The third line in the above code chunk has transposed (t()) the matrix returned by cpm(log = TRUE) to place the samples as the rows, which is where the function prcomp() expects to see the variables of interest.

A quick inspection of the results shows that the first two components capture most of the variability, as expected. Beyond this observation, the details of PCA are beyond what we can cover here.

summary(pca)$importance %>% pander(split.tables = Inf)
  PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 10.88 8.17 5.864 5.683 5.197 4.838 1.44e-14
Proportion of Variance 0.3917 0.2209 0.1138 0.1069 0.08936 0.07746 0
Cumulative Proportion 0.3917 0.6125 0.7263 0.8332 0.9225 1 1

We can also plot our results to see if samples group clearly with the treatment group based on our main two principal components. Any clear separation can be considered a positive sign that we will find differentially expressed genes. The following chunk wraps the plotting inside plotly::ggplotly() which takes a ggplot object and makes it interactive. You may notice that label was included as a plotting aesthetic, but no labels were added as a layer in ggplot2. These will instead be added once the plot is made interactive. Interactive plots will only be interactive when compiling an RMarkdown document to an html format, and will remain static if rendering to an MS Word document or pdf.

plotly::ggplotly(
    pca$x %>%
        as.data.frame() %>%
        rownames_to_column("sample") %>%
        as_tibble() %>%
        dplyr::select(sample, PC1, PC2) %>%
        left_join(rownames_to_column(dgeFilt$samples, "sample")) %>%
        ggplot(aes(PC1, PC2, colour = group, label = sample)) +
        geom_point(size = 3) +
        theme_bw()
)

PCA showing two clear groups in the data

In many workflows you may see the function plotMDS() which produces a similar plot to the above. Multi-Dimensional Scaling can look similar to PCA, but asks a slightly different question of the data. Usually results found under both methods will reveal similar patterns in your data. plotMDS() doesn’t quite give you as pretty a picture as the above either.

Differential Expression

In the previous sections we have worked with read counts, which are a discrete value and formally cannot be modelled using the assumption of normally distributed data. This rules out linear models and t-tests for RNA-Seq, so many packages have been developed which use the negative binomial distribution to model these counts (e.g. edgeR and DESeq2). An alternative was proposed by Law et al, where they apply a system of weights to the counts which allow the assumption of normality to be applied. This method is called voom and we’ll use this today.

voomData <- voom(dgeFilt)

Note that this has added a design matrix to the data based on our groups column, and we can use this to perform a simple linear regression on each gene, which amounts to a t-test in this dataset. From here it’s a simple matter to code the analysis and inspect results.

A lot of analysis is performed in the following code chunk:

  1. We’ll use lmFit() to fit the model for every gene using the design matrix in voomData. Note that we’re not using CPM, but are instead fitting the counts directly, after incorporation of the voom-derived weights.
  2. Next we moderate the variances using an empirical Bayes approach. This uses the assumption that our variance estimates will a mix of overestimates and underestimates, and shrinks them all towards a central value. As a strategy, this is widely accepted and has been shown to increase power and reduce false positives.
  3. Finally, we’ll create a list of all genes with a summary of the results which we then convert to a tibble. The last step will remove the gene IDs (i.e. the row names), but because we’ve included our gene information in the DGEList object, this will be added to he results and we’ll still know which gene is which.
topTable <- voomData %>% 
    lmFit() %>%
    eBayes %>%
    topTable(coef = "group24mth", n = Inf) %>%
    as_tibble()

Note that the GRanges information has been coerced into columns to form a data.frame/tibble and that looks a bit messy. To tidy things up, the following code will join all of the genomic co-ordinates into a single column, rename a few columns on the fly and by not asking for ID.width & ID.gene_biotype, we’ve saved ourselves dealing with a few redundant columns. We could actually add this to the chunk above to save forming then modifying an R object, but it’s been left separate for clarity.

topTable <- topTable %>%
    unite("Range", ID.start, ID.end, sep = "-") %>%
    unite("Location", ID.seqnames, Range, ID.strand, sep = ":") %>%
    dplyr::select(Geneid = ID.gene_id, 
                  Symbol = ID.gene_name,
                  AveExpr, logFC, t, P.Value, 
                  FDR = adj.P.Val, 
                  Location, 
                  Entrez = ID.entrezid)

Now that we have our ranked list of genes, we should really inspect these visually. A commonly used plot is a volcano plot, where we place the logFC estimates on the x-axis, and the position on the y-axis relates to the strength of statistical significance. Before plotting, we added a simple column called DE to indicate whether we considered a gene to be DE, based purely on an FDR-adjusted p-value < 0.05. We’ll use this to colour points.

topTable %>%
    mutate(DE = FDR < 0.05) %>%
    ggplot(aes(logFC, -log10(P.Value), colour = DE)) +
    geom_point(alpha = 0.5) +
    geom_text_repel(data = . %>% 
                        dplyr::filter(DE) %>%
                        dplyr::filter(-log10(P.Value) > 4 | abs(logFC) > 2.5),
                    aes(label = Symbol)) + 
    scale_colour_manual(values = c("grey", "red")) +
    theme_bw() +
    theme(legend.position = "none")
Volcano plot showing DE genes between the two timepoints

Volcano plot showing DE genes between the two timepoints

As another perspective, we might like to see whether these genes are at the high or low end of the range for expression values. This is often referred to as an MD plot, which stands for Mean expression vs Difference, where difference is more commonly referred to as logFC. We’ve added an arrange() call in the following to ensure our DE genes are plotted last and are the most visible in our figure.

topTable %>%
    mutate(DE = FDR < 0.05) %>%
    arrange(desc(P.Value)) %>%
    ggplot(aes(AveExpr, logFC, colour = DE)) +
    geom_point(alpha = 0.5) +
    geom_text_repel(data = . %>% 
                        dplyr::filter(DE) %>%
                        dplyr::filter(abs(logFC) > 2 | AveExpr > 14),
                    aes(label = Symbol)) + 
    scale_colour_manual(values = c("grey", "red")) +
    labs(x = "Average Expression (log2 CPM)",
         y = "log Fold-Change") +
    theme_bw() +
    theme(legend.position = "none")
Mean-Difference plot showing fold-change potted against expression level. Genes considered as DE are highlighted in red.

Mean-Difference plot showing fold-change potted against expression level. Genes considered as DE are highlighted in red.

If were happy with our results, we could exprt this list using write_csv() or write_tsv(). We could also produce a short table summarising the “big guns” in the dataset. As there are more than 100 genes considered as DE here, let’s restrict to those with logFC beyond the range \(\pm1\), which equates to 2-fold up or down-regulation.

topTable %>%
    dplyr::filter(FDR < 0.05, abs(logFC) > 1) %>%
    dplyr::select(ID = Geneid, Symbol, AveExpr, logFC, P.Value, FDR) %>%
    pander(caption = paste("The", nrow(.), "most DE genes when ranked by p-value, and filtered on a logFC beyond the range $\\pm1$"))
The 25 most DE genes when ranked by p-value, and filtered on a logFC beyond the range \(\pm1\)
ID Symbol AveExpr logFC P.Value FDR
ENSDARG00000017780 rorcb 10.06 -2.07 1.708e-05 0.009733
ENSDARG00000103518 abhd8a 8.493 -1.068 0.000239 0.01881
ENSDARG00000103770 ARMC6 7.279 -1.166 0.0002948 0.01977
ENSDARG00000085497 RF00009 12.45 2.656 0.0004169 0.0226
ENSDARG00000092225 si:dkeyp-13a3.10 6.77 1.154 0.0005436 0.0226
ENSDARG00000083986 RF01241 5.086 -2.678 0.000226 0.01881
ENSDARG00000013576 gadd45bb 6.786 1.217 0.0006081 0.0226
ENSDARG00000009014 col11a1b 6.503 1.546 0.0006446 0.0226
ENSDARG00000038669 gbp2 6.709 1.26 0.0008858 0.02463
ENSDARG00000074604 phc3 6.026 1.096 0.0008169 0.02388
ENSDARG00000054292 si:ch211-14a17.11 6.255 -1.214 0.001542 0.02966
ENSDARG00000056490 zgc:110158 9.763 -1.348 0.001592 0.02966
ENSDARG00000093774 rbp2b 4.08 -3.109 0.0006403 0.0226
ENSDARG00000054837 zgc:136870 4.634 1.538 0.001398 0.02966
ENSDARG00000039517 c8b 6.156 1.133 0.002601 0.03423
ENSDARG00000079589 si:dkeyp-73d8.6 5.234 2.824 0.003386 0.0386
ENSDARG00000074846 si:dkeyp-68b7.12 6.537 1.161 0.005174 0.04794
ENSDARG00000092780 si:ch1073-170o4.1 4.844 1.484 0.004274 0.0444
ENSDARG00000058695 ddr2b 4.954 1.145 0.005788 0.04937
ENSDARG00000080942 RF00572 2.686 2.053 0.005527 0.04884
ENSDARG00000093205 AL845362.1 1.478 -2.52 0.003694 0.04049
ENSDARG00000093060 BX682548.1 2.388 2.475 0.00494 0.04662
ENSDARG00000093203 ftr32 1.308 2.651 0.002639 0.03423
ENSDARG00000079030 ftr39p 0.5909 2.324 0.003327 0.03831
ENSDARG00000054202 hbl4 1.202 2.472 0.005846 0.04937

GO Enrichment

Before we perform this step, we need to install two missing packages to our VM, which are located on the Bioconductor repository. Note the slightly different installation method for these packages.

BiocManager::install(c("GO.db", "org.Hs.eg.db"))

In order to connect our genes to any larger biological behaviours, enrichments tests are commonly performed, with Gene Ontology (GO) terms being a common way to describe biology. The tool we’ll use today is the function goana() from the package limma which essentially performs Fisher’s Exact Test as an enrichment test. Internally it is framed as a hypergeometric distribution, which is equivalent.

GO annotations are generally more extensive for EntrezGene identifiers than for Ensembl identifiers, so goana() has been implemented to only accept EntrezGene IDs. The function also relies on specific databases for mapping gene IDs to GO terms, however this package does not exist for zebrafish. AS a result, we’ll map our Ensembl IDs to the human equivalent EntrezGene IDs, and we’ll assume that conservation of function across species holds, which is not unreasonable in many cases.

We’ve prepared this object for you, and it’s hosted as part of today’s material at https://uofabioinformaticshub.github.io/Intro-NGS-fib. However, it’s in the directory data and is named ens2Entrez.tsv. We can form this directly into a file.path, then pass it to read_tsv() as a url() do we can load directly from the remote repository without downloading the file.

ens2Entrez <- file.path("https://uofabioinformaticshub.github.io/Intro-NGS-fib", "data", "ens2Entrez.tsv") %>% 
    url() %>%
    read_tsv()

For goana() to analyse our data, we need to provide a list of DE genes, and then the larger body (i.e. universe) of genes from which they are drawn. In general, this should be the list of genes detected as expressed in your sample, not the complete genome. Ask a tutor if you don’t understand why.

de <- topTable %>%
    dplyr::filter(FDR < 0.05) %>%
    dplyr::select(Geneid) %>%
    left_join(ens2Entrez) %>%
    dplyr::filter(!is.na(Entrez)) %>%
    .[["Entrez"]] %>%
    unique()
uv <- topTable %>%
    dplyr::select(Geneid) %>%
    left_join(ens2Entrez) %>%
    dplyr::filter(!is.na(Entrez)) %>%
    .[["Entrez"]] %>%
    unique()

From here, we simply pass our two gene sets to goana() and a set of unsorted results will be returned without any adjustment of p-values for multiple comparisons.

goResults <- goana(de = de, universe = uv, species = "Hs")
head(goResults)

This will contain numerous GO terms not found in our set of DE genes. We can remove these from the data and in the following, we’re only considering GO terms with more than one gene mapping to it, to ensure we’re capturing biology across pathways, rather than just single genes with relatively unique functions. After filtering, we need to adjust our p-value, filter our results, then return a table to share with our collaborators.

goResults %>% 
    rownames_to_column("GO ID") %>%
    as_tibble() %>%
    dplyr::filter(DE > 1) %>%
    arrange(P.DE) %>%
    mutate(FDR = p.adjust(P.DE, "fdr")) %>%
    dplyr::filter(FDR < 0.05) %>%
    mutate(`GO ID` = str_replace(`GO ID`, ":", "\\\\:")) %>%
    pander(caption = "GO Terms potentially enriched in the set of differentially expressed genes")
GO Terms potentially enriched in the set of differentially expressed genes
GO ID Term Ont N DE P.DE FDR
GO:0016887 ATPase activity MF 27 13 2.84e-05 0.04405

NB: In the above, I added an escaping \ to the colon (:) in the GO ID. This is optional, but prevents markdown from trying to render the ID as a hyperlink when knitting to html.