Now that we have our data in a nicely formatted file, we can move to R
and follow a fairly standard workflow. Along the way, we’ll come across a few useful packages, data structures and coding tricks which will be applicable in many other contexts.
The packages we’ll use for this include not only the tidyverse
, but a few other packages which are hosted on Bioconductor, such as limma
, edgeR
and AnnotationHub
. We’ll also add the magrittr
,pander
and scales
packages as they contain some useful additional utilities.
library(limma)
library(edgeR)
library(AnnotationHub)
library(tidyverse)
library(magrittr)
library(scales)
library(pander)
library(ggrepel)
First we should import the file we’ve created using featureCounts
.
counts <- read_tsv("../2_alignedData/counts/genes.out")
This file has the gene identifiers in the first column, with the remaining columns containing the name of the bam
file and the number of reads aligning to each gene. The first thing we might like to do is tidy up those column names.
colnames(counts) <- colnames(counts) %>%
basename() %>%
str_remove("_(10|23)_(03|04)_(2014|2016)_S[0-9]_fem_Aligned.+")
That looks much cleaner and we haven’t lost any important information.
The main object type we like to use for differential gene expression is a DGEList
, which stands for Digital Gene Expression List. These objects have two mandatory elements, with the first being our counts and the second being our samples. In these objects, we can consider the gene IDs to be the row names and the sample names are the column names. Here’s one way to create a DGEList
.
dgeList <- counts %>%
as.data.frame() %>%
column_to_rownames("Geneid") %>%
DGEList() %>%
calcNormFactors()
In the first 3 lines, we’re just setting the gene IDs as the row names instead of being a column. From there we’ve turned the remaining columns (i.e. the counts) into a DGEList
object, and calculated the normalisation factors, which can be seen in dgeList$samples
If fitting counts using a negative binomial model (for discrete data), these norm.factors
are included in the model to adjust for variations in the library size and count distributions. We’ll use a different approach for our analysis today, but doing this as we form the object is still good practice, in case we change tack halfway through our planned analysis.
dgeList$samples
In the samples
element, we can also set the group variable so let’s put our time-points here.
dgeList$samples$group <- colnames(dgeList) %>%
str_extract("(6|24)mth") %>%
factor(levels = c("6mth", "24mth"))
A common, but optional element that we can include in a DGEList
is one called genes
. This is where we can place information such as the genomic location, the gene symbol and anything else we think is going to be relevant. First, we need to find this information though, and we’ll use the package AnnotationHub
which contains the metadata for all of the annotation packages included in Bioconductor.
ah <- AnnotationHub()
ah
This structure is quite advanced, but if you just type in the object name, you’ll get an informative summary of all available annotation packages. Note that we have numerous data providers, species and data classes. The column on the left is also a shorthand ID we can use to retrieve any of these annotation objects.
We can also subset the Annotation Hub object to help us find what we’re looking for. In the following code, we’re restricting our search to zebrafish, with data held sourced from Ensembl. The final line looks for an object of class EnsDb
which is a database object containing a large amount of information, but in a relatively user-friendly way.
ah %>%
subset(species == "Danio rerio") %>%
subset(dataprovider == "Ensembl") %>%
subset(rdataclass == "EnsDb")
The last of these objects contains data from release 94, which will be close enough to match the data we have generated so far (which was Ensembl 95). Ideally, these should be the same, but there is no EnsDb object for Ensembl 95. I’ sure one will be made available in the next release of Bioconductor. Let’s use this annotation object and define it as the object ensDb
ensDb <- ah[["AH64906"]]
ensDb
There are many helper functions for extracting data from this package, such as transcripts()
, promoters()
and genes()
. We want gene information today, so let’s just use that function. Note that after we obtain the data, we’re using subset()
to only keep the data from chromosome 2.
genesGR <- genes(ensDb) %>%
subset(seqnames == 2)
This object is a GenomicRanges
(or GRanges
) object and these are the brad & butter of genomic analysis in R. We could spend hours just looking at these, but the main point is that on the left of the “pipes” we have the chromosome (seqnames
) followed by the range of bases the genes is contained within and the strand the gene is located on. This is the core GRanges
element, and we could simply return this information using the function granges()
granges(genesGR)
Underlying each GRanges
object is a seqinfo
object and these contain all of the genomic information about chromosome names and lengths. If comparing two GRanges
objects, they must have identical seqinfo
objects otherwise the comparison will return an error. This actually makes perfect sense, but can create issues when comparing data from UCSC, NCBI and Ensembl as they all use different formats for their chromosome names, even though they’re based on the same assemblies.
seqinfo(genesGR)
To the right of the pipes, we have the metadata columns, accessed using the function mcols()
. Notice this returns a DataFrame
which is a slightly more controlled version of a data.frame
. The differences are beyond the scope of this course, but they can easily be coerced to a data.frame
using as.data.frame
. If you want to use the dplyr
functions on them, you will need to go through this step.
mcols(genesGR)
When we place this information into our DGEList
object, some of those columns look a bit redundant, so let’s just keep four of the most useful ones.
mcols(genesGR) <- mcols(genesGR)[c("gene_id", "gene_name", "gene_biotype", "entrezid")]
If we wish to add this to our DGEList
, note that the order of genes will be completely different. To fix this, we just use the rownames
of our DGEList
to reorder the genes
object. We can also just create the new element $genes
by simply typing it & providing a value. R
will not check that these two objects are compatible, so you have to be on your toes here.
dgeList$genes <- genesGR[rownames(dgeList),]
Now when we subset our DGEList
by gene, the genes
element will also be subset accordingly and the initial relationships will be maintained.
dgeList[1:4,]
As you may have noticed, no reads aligned to the 4th gene in this dataset so we should really remove it from the data. There are probably many more genes in this boat too. Let’s do a logical test to see how many genes were not detected in our dataset. First we’ll add up the total counts for every gene and see how many received at least one count.
dgeList$counts %>%
rowSums() %>%
is_greater_than(0) %>%
table()
Clearly, a good proportion of our genes were not expressed in our original samples. A common approach would be to remove undetectable genes using some metric, such as Counts per Million reads, known as cpm
. We could consider a gene detectable if returning more than 1CPM in every sample from one of the treatment groups. Although our dataset is small (all libraries are < 1e6 reads), we usually deal with libraries between 20-30million reads, and this would equate to 20-30 reads aligning to a gene, in every sample from a treatment group. Here our smallest group is 3 so let’s see what would happen if we applied that filter.
First we’ll calculate the cpm
for each gene in each sample, and then we’ll apply a logical test to each value checking whether it’s greater than 1. Next, we’ll add these up for each gene (i.e. row) and this will give us the total number of samples for each gene, that passed our criteria of \(cpm > 1\). Finally, we’ll check for genes which passed our criteria in more than 3 samples, as our smallest group is 3. We could also have used is_weakly_greater_than()
which would test for equality (\(\geq\)) instead of strictly greater than (\(>\)).
dgeList %>%
cpm() %>%
is_greater_than(1) %>%
rowSums() %>%
is_greater_than(3) %>%
table()
Losing about 1/3 of the genes is pretty common, so let’s now apply that to our dataset. The object genes2keep
below will be a logical
vector deciding on whether we keep the gene for downstream analysis, based purely on whether we consider the gene to be detectable. We’ll create a new DGEList
object by subsetting our primary one. This way if we change our mind about our filtering strategy, we don’t have to rerun all the code above.
genes2keep <- dgeList %>%
cpm() %>%
is_greater_than(1) %>%
rowSums() %>%
is_greater_than(3)
dgeFilt <- dgeList[genes2keep,] %>% calcNormFactors()
Let’s compare the distributions of the two datasets, using cpm on the log2 scale. In the following the command par(mfrow = c(1,2))
is a base graphics approach and sets the plotting par
ameters to be a m
ulti=f
eature layout, with 1 row and 2 columns. We’ve done this because the convenient function plotDensities()
uses base graphics not ggplot2
. It’s nowhere near as pretty.
par(mfrow = c(1,2))
dgeList %>%
cpm(log = TRUE) %>%
plotDensities(legend = FALSE, main = "A. Before Filtering")
dgeFilt %>%
cpm(log = TRUE) %>%
plotDensities(legend = FALSE, main = "B. After Filtering")
par(mfrow = c(1,1))
Note the peak at the left in the first plot around zero. This is all of the genes with near-zero counts. Then note that this peak is missing the second plot, confirming that we have removed most of the undetectable genes.
Next we should check our library sizes. It does appear that these being prepared on different days has given one of our groups more reads. This is not ideal but most modelling approaches will be able to handle this.
dgeFilt$samples %>%
ggplot(aes(group, lib.size, fill = group)) +
geom_boxplot() +
scale_y_continuous(labels = comma) +
labs(x = "Timepoint", y = "Library Size") +
theme_bw()
We’ve used he function comma
from the scales
package here to help us interpret the y-axis. For most vertebrate datasets, you’d expect >20million reads, but as we’ve given you a significantly reduced dataset, these numbers are pretty acceptable.
Next we might choose to perform a Principal Component Analysis on our data, commonly abbreviated to PCA. This time, let’s take our CPM values & asses them on the log2 scale to make sure our PCA results are not heavily skewed by highly expressed genes.
pca <- dgeFilt %>%
cpm(log = TRUE) %>%
t() %>%
prcomp()
In our DGEList, we have the genes as the variables of interest for our main analysis, however for the PCA we’re looking at out samples as the variables of interest. The third line in the above code chunk has transposed (t()
) the matrix returned by cpm(log = TRUE)
to place the samples as the rows, which is where the function prcomp()
expects to see the variables of interest.
A quick inspection of the results shows that the first two components capture most of the variability, as expected. Beyond this observation, the details of PCA are beyond what we can cover here.
summary(pca)$importance %>% pander(split.tables = Inf)
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | |
---|---|---|---|---|---|---|---|
Standard deviation | 10.88 | 8.17 | 5.864 | 5.683 | 5.197 | 4.838 | 1.44e-14 |
Proportion of Variance | 0.3917 | 0.2209 | 0.1138 | 0.1069 | 0.08936 | 0.07746 | 0 |
Cumulative Proportion | 0.3917 | 0.6125 | 0.7263 | 0.8332 | 0.9225 | 1 | 1 |
We can also plot our results to see if samples group clearly with the treatment group based on our main two principal components. Any clear separation can be considered a positive sign that we will find differentially expressed genes. The following chunk wraps the plotting inside plotly::ggplotly()
which takes a ggplot
object and makes it interactive. You may notice that label
was included as a plotting aesthetic, but no labels were added as a layer in ggplot2. These will instead be added once the plot is made interactive. Interactive plots will only be interactive when compiling an RMarkdown document to an html format, and will remain static if rendering to an MS Word document or pdf.
plotly::ggplotly(
pca$x %>%
as.data.frame() %>%
rownames_to_column("sample") %>%
as_tibble() %>%
dplyr::select(sample, PC1, PC2) %>%
left_join(rownames_to_column(dgeFilt$samples, "sample")) %>%
ggplot(aes(PC1, PC2, colour = group, label = sample)) +
geom_point(size = 3) +
theme_bw()
)
In many workflows you may see the function plotMDS()
which produces a similar plot to the above. Multi-Dimensional Scaling can look similar to PCA, but asks a slightly different question of the data. Usually results found under both methods will reveal similar patterns in your data. plotMDS()
doesn’t quite give you as pretty a picture as the above either.
In the previous sections we have worked with read counts, which are a discrete value and formally cannot be modelled using the assumption of normally distributed data. This rules out linear models and t-tests for RNA-Seq, so many packages have been developed which use the negative binomial distribution to model these counts (e.g. edgeR
and DESeq2
). An alternative was proposed by Law et al, where they apply a system of weights to the counts which allow the assumption of normality to be applied. This method is called voom
and we’ll use this today.
voomData <- voom(dgeFilt)
Note that this has added a design matrix to the data based on our groups
column, and we can use this to perform a simple linear regression on each gene, which amounts to a t-test in this dataset. From here it’s a simple matter to code the analysis and inspect results.
A lot of analysis is performed in the following code chunk:
lmFit()
to fit the model for every gene using the design matrix in voomData
. Note that we’re not using CPM, but are instead fitting the counts directly, after incorporation of the voom-derived weights.tibble
. The last step will remove the gene IDs (i.e. the row names), but because we’ve included our gene information in the DGEList
object, this will be added to he results and we’ll still know which gene is which.topTable <- voomData %>%
lmFit() %>%
eBayes %>%
topTable(coef = "group24mth", n = Inf) %>%
as_tibble()
Note that the GRanges information has been coerced into columns to form a data.frame
/tibble
and that looks a bit messy. To tidy things up, the following code will join all of the genomic co-ordinates into a single column, rename a few columns on the fly and by not asking for ID.width
& ID.gene_biotype
, we’ve saved ourselves dealing with a few redundant columns. We could actually add this to the chunk above to save forming then modifying an R
object, but it’s been left separate for clarity.
topTable <- topTable %>%
unite("Range", ID.start, ID.end, sep = "-") %>%
unite("Location", ID.seqnames, Range, ID.strand, sep = ":") %>%
dplyr::select(Geneid = ID.gene_id,
Symbol = ID.gene_name,
AveExpr, logFC, t, P.Value,
FDR = adj.P.Val,
Location,
Entrez = ID.entrezid)
Now that we have our ranked list of genes, we should really inspect these visually. A commonly used plot is a volcano plot, where we place the logFC estimates on the x-axis, and the position on the y-axis relates to the strength of statistical significance. Before plotting, we added a simple column called DE
to indicate whether we considered a gene to be DE, based purely on an FDR-adjusted p-value < 0.05. We’ll use this to colour points.
topTable %>%
mutate(DE = FDR < 0.05) %>%
ggplot(aes(logFC, -log10(P.Value), colour = DE)) +
geom_point(alpha = 0.5) +
geom_text_repel(data = . %>%
dplyr::filter(DE) %>%
dplyr::filter(-log10(P.Value) > 4 | abs(logFC) > 2.5),
aes(label = Symbol)) +
scale_colour_manual(values = c("grey", "red")) +
theme_bw() +
theme(legend.position = "none")
As another perspective, we might like to see whether these genes are at the high or low end of the range for expression values. This is often referred to as an MD plot, which stands for M
ean expression vs D
ifference, where difference is more commonly referred to as logFC. We’ve added an arrange()
call in the following to ensure our DE genes are plotted last and are the most visible in our figure.
topTable %>%
mutate(DE = FDR < 0.05) %>%
arrange(desc(P.Value)) %>%
ggplot(aes(AveExpr, logFC, colour = DE)) +
geom_point(alpha = 0.5) +
geom_text_repel(data = . %>%
dplyr::filter(DE) %>%
dplyr::filter(abs(logFC) > 2 | AveExpr > 14),
aes(label = Symbol)) +
scale_colour_manual(values = c("grey", "red")) +
labs(x = "Average Expression (log2 CPM)",
y = "log Fold-Change") +
theme_bw() +
theme(legend.position = "none")
If were happy with our results, we could exprt this list using write_csv()
or write_tsv()
. We could also produce a short table summarising the “big guns” in the dataset. As there are more than 100 genes considered as DE here, let’s restrict to those with logFC beyond the range \(\pm1\), which equates to 2-fold up or down-regulation.
topTable %>%
dplyr::filter(FDR < 0.05, abs(logFC) > 1) %>%
dplyr::select(ID = Geneid, Symbol, AveExpr, logFC, P.Value, FDR) %>%
pander(caption = paste("The", nrow(.), "most DE genes when ranked by p-value, and filtered on a logFC beyond the range $\\pm1$"))
ID | Symbol | AveExpr | logFC | P.Value | FDR |
---|---|---|---|---|---|
ENSDARG00000017780 | rorcb | 10.06 | -2.07 | 1.708e-05 | 0.009733 |
ENSDARG00000103518 | abhd8a | 8.493 | -1.068 | 0.000239 | 0.01881 |
ENSDARG00000103770 | ARMC6 | 7.279 | -1.166 | 0.0002948 | 0.01977 |
ENSDARG00000085497 | RF00009 | 12.45 | 2.656 | 0.0004169 | 0.0226 |
ENSDARG00000092225 | si:dkeyp-13a3.10 | 6.77 | 1.154 | 0.0005436 | 0.0226 |
ENSDARG00000083986 | RF01241 | 5.086 | -2.678 | 0.000226 | 0.01881 |
ENSDARG00000013576 | gadd45bb | 6.786 | 1.217 | 0.0006081 | 0.0226 |
ENSDARG00000009014 | col11a1b | 6.503 | 1.546 | 0.0006446 | 0.0226 |
ENSDARG00000038669 | gbp2 | 6.709 | 1.26 | 0.0008858 | 0.02463 |
ENSDARG00000074604 | phc3 | 6.026 | 1.096 | 0.0008169 | 0.02388 |
ENSDARG00000054292 | si:ch211-14a17.11 | 6.255 | -1.214 | 0.001542 | 0.02966 |
ENSDARG00000056490 | zgc:110158 | 9.763 | -1.348 | 0.001592 | 0.02966 |
ENSDARG00000093774 | rbp2b | 4.08 | -3.109 | 0.0006403 | 0.0226 |
ENSDARG00000054837 | zgc:136870 | 4.634 | 1.538 | 0.001398 | 0.02966 |
ENSDARG00000039517 | c8b | 6.156 | 1.133 | 0.002601 | 0.03423 |
ENSDARG00000079589 | si:dkeyp-73d8.6 | 5.234 | 2.824 | 0.003386 | 0.0386 |
ENSDARG00000074846 | si:dkeyp-68b7.12 | 6.537 | 1.161 | 0.005174 | 0.04794 |
ENSDARG00000092780 | si:ch1073-170o4.1 | 4.844 | 1.484 | 0.004274 | 0.0444 |
ENSDARG00000058695 | ddr2b | 4.954 | 1.145 | 0.005788 | 0.04937 |
ENSDARG00000080942 | RF00572 | 2.686 | 2.053 | 0.005527 | 0.04884 |
ENSDARG00000093205 | AL845362.1 | 1.478 | -2.52 | 0.003694 | 0.04049 |
ENSDARG00000093060 | BX682548.1 | 2.388 | 2.475 | 0.00494 | 0.04662 |
ENSDARG00000093203 | ftr32 | 1.308 | 2.651 | 0.002639 | 0.03423 |
ENSDARG00000079030 | ftr39p | 0.5909 | 2.324 | 0.003327 | 0.03831 |
ENSDARG00000054202 | hbl4 | 1.202 | 2.472 | 0.005846 | 0.04937 |
Before we perform this step, we need to install two missing packages to our VM, which are located on the Bioconductor repository. Note the slightly different installation method for these packages.
BiocManager::install(c("GO.db", "org.Hs.eg.db"))
In order to connect our genes to any larger biological behaviours, enrichments tests are commonly performed, with Gene Ontology (GO) terms being a common way to describe biology. The tool we’ll use today is the function goana()
from the package limma
which essentially performs Fisher’s Exact Test as an enrichment test. Internally it is framed as a hypergeometric distribution, which is equivalent.
GO annotations are generally more extensive for EntrezGene identifiers than for Ensembl identifiers, so goana()
has been implemented to only accept EntrezGene IDs. The function also relies on specific databases for mapping gene IDs to GO terms, however this package does not exist for zebrafish. AS a result, we’ll map our Ensembl IDs to the human equivalent EntrezGene IDs, and we’ll assume that conservation of function across species holds, which is not unreasonable in many cases.
We’ve prepared this object for you, and it’s hosted as part of today’s material at https://uofabioinformaticshub.github.io/Intro-NGS-fib. However, it’s in the directory data
and is named ens2Entrez.tsv
. We can form this directly into a file.path, then pass it to read_tsv()
as a url()
do we can load directly from the remote repository without downloading the file.
ens2Entrez <- file.path("https://uofabioinformaticshub.github.io/Intro-NGS-fib", "data", "ens2Entrez.tsv") %>%
url() %>%
read_tsv()
For goana()
to analyse our data, we need to provide a list of DE genes, and then the larger body (i.e. universe) of genes from which they are drawn. In general, this should be the list of genes detected as expressed in your sample, not the complete genome. Ask a tutor if you don’t understand why.
de <- topTable %>%
dplyr::filter(FDR < 0.05) %>%
dplyr::select(Geneid) %>%
left_join(ens2Entrez) %>%
dplyr::filter(!is.na(Entrez)) %>%
.[["Entrez"]] %>%
unique()
uv <- topTable %>%
dplyr::select(Geneid) %>%
left_join(ens2Entrez) %>%
dplyr::filter(!is.na(Entrez)) %>%
.[["Entrez"]] %>%
unique()
From here, we simply pass our two gene sets to goana()
and a set of unsorted results will be returned without any adjustment of p-values for multiple comparisons.
Ont
indicates whether the term belongs to the Biological Process, Cellular Component or Molecular Function ontologyN
counts how many genes in the “universe” mapped to each GO termDE
shows how many genes in the set of DE genes mapped to each GO term.P.DE
is the results of the enrichment test for each term, and is not adjusted for multiple comparisons.goResults <- goana(de = de, universe = uv, species = "Hs")
head(goResults)
This will contain numerous GO terms not found in our set of DE genes. We can remove these from the data and in the following, we’re only considering GO terms with more than one gene mapping to it, to ensure we’re capturing biology across pathways, rather than just single genes with relatively unique functions. After filtering, we need to adjust our p-value, filter our results, then return a table to share with our collaborators.
goResults %>%
rownames_to_column("GO ID") %>%
as_tibble() %>%
dplyr::filter(DE > 1) %>%
arrange(P.DE) %>%
mutate(FDR = p.adjust(P.DE, "fdr")) %>%
dplyr::filter(FDR < 0.05) %>%
mutate(`GO ID` = str_replace(`GO ID`, ":", "\\\\:")) %>%
pander(caption = "GO Terms potentially enriched in the set of differentially expressed genes")
GO ID | Term | Ont | N | DE | P.DE | FDR |
---|---|---|---|---|---|---|
GO:0016887 | ATPase activity | MF | 27 | 13 | 2.84e-05 | 0.04405 |
NB: In the above, I added an escaping \
to the colon (:
) in the GO ID. This is optional, but prevents markdown from trying to render the ID as a hyperlink when knitting to html.