Now we’ve made some informative plots, the next level of information we may like to include is some information about each gene, besides just a generic identifier. Our RNA-Seq data has been aligned to the genome and summarised to genes based on Ensembl release 96. This release uses the GRCm38 genome build as the reference and has a particular set of genes, based on the available information and gene models at the time of release. It’s not uncommon for gene identifiers to exist, or not exist in different Ensembl releases. In this section of the course, we’ll look at how we can work with annotations and different databases in R.
All of the packages we have come across so far have been general R packages, and these are hosted on a repository of packages known as CRAN. To install any of these packages, we can simply use the command install.packages("packageName").
A second repository of packages exists, known as Bioconductor and all of the packages hosted here are specific to working with biological data. There is one CRAN package (BiocManager) which provides us access to this repository. We’ve installed all of the required packages for you today, but the function (install("packageName)) provided by this package gives you access to packages both on CRAN and on Bioconductor. For many of us, it’s our default installation method.
Bioconductor itself hosts over 1700 packages, with four primary classification categories: Software, AnnotationData, ExperimentData and Workflow. These categories (known as BiocViews) are not mutually exclusive.
To start working with gene-level information, we first need to find the appropriate annotation package for our dataset. There is a “master” package (AnnotationHub) which contains the metadata for all annotation packages, and allows for simple interaction with all packages.
The first step will be to load this package, then form an R object which contains all of the metadata. This will automatically check with the the main repository to ensure that all metadata is up to date.
library(AnnotationHub)
ah <- AnnotationHub()
ah
Notice that we no longer have a spreadsheet-like object. When we print the contents of this object, the first few lines will start with the # symbol and is essentially a header providing us with useful information. Here we can see the various data sources ($dataprovider), species ($species) and the type of R object that is returned ($rdataclass). Just like with a data.frame, we can access each of these fields using the $ symbol, however, that will return a vector of length 45228 so may not be useful. All of the metadata is obtained using the function mcols(), which is a common function in Bioconductor packages, which returns a DataFrame of your metadata with our familiar column structure (hence, mcols()).
mcols(ah)
Most of this information is beyond the scope of today’s session, but you may have noticed that this time we returned a DataFrame instead of a tibble or data.frame. This was a version of the data.frame in use by the Bioconductor community long before Hadley Wickham emerged as a programmer, and this form has stayed as the primary form amongst Bioconductor packages. Again, it’s very similar to a data.frame just with slightly different wrapping paper on it than a tibble. However, it doesn’t play nicely with the dplyr set of functions, so every time we do come across one that we wish to use dplyr functions on, we need to convert to a generic data.frame using the function as.data.frame(). This is one of the perils of a large and fragmented open-source community with particular fragments having diverse requirements, for example, those using genomic information have completely different requirements to those analysing social media.
What we do need to know is that we can use the data in these columns to subset our ah object. As this is not strictly a spreadsheet-like structure, we’ll use the relatively general function subset() to find what we need. We were told earlier that this is a mouse dataset, aligned to Ensembl release 96. The object type we’re going to use from here is an EnsDb object, which is way of storing all of the relevant information from the Ensembl database for our specific genome and annotation release.
We can use the combined logical test incorporating the ampersand (&) to restrict our results to the elements of ah containing the relevant information.
subset(ah, rdataclass == "EnsDb" & species == "Mus musculus")
Here we have just printed the elements of ah which define an EnsDb object for Mus musculus, and you’ll see every Ensembl release here, going back to 87, which was the release at the time of the initial package being developed. To the left of the pipe symbol (|) you’ll see the relevant identifier for each annotation object, and to load this into our workspace we use the code below. If the package is not already located on your machine, it will be downloaded and installed.
Also note the use of double brackets in the following code In R, if we use single brackets to subset an object we return an object of the same structure as the main one, just a smaller version of it. If we use double brackets, we only return one element of the larger object, and it will have a different underlying structure. If we’d used a single bracket (ah["AH69210"]) we would have returned a smaller AnnotationHub object, whilst by using the double brackets, we’ve actually returned the EnsDb object at the heart of is. R objects can have rather complex structures sometimes, but this principle is consistent across all objects.
ensDb <- ah[["AH69210"]]
ensDb
Now we have our ensDb object, we can see that the genome build is as expected, and we have a few additional pieces of information such as the number of genes and transcripts. Today we’ll only be working at the gene-level as transcript-level analysis is still relatively poorly understood.
To obtain our genes, we can simply use the function genes() and this will return an object of class GRanges or GenomicRanges. This appears to be a spreadsheet-like structure at first glance, but under the hood there have been structural restrictions placed on it, such as the requirement for a defined genome.
genesGR <- genes(ensDb)
genesGR
In this object, our genomic ranges associated with each gene appear on the left of the pipe symbol, whilst the metadata (accessed using mcols()) is on the right of these. In the metadata columns, you’ll see the type of information we’re after. Let’s have a quick look through the object.
seqinfo(genesGR)
The defined genome which underlies the object can be accessed using the seqinfo() function. Here you’ll see all of the key information such as whether the chromosomes are 1, 2, 3 etc., or chr1, chr2, chr3 etc. We can also see our chromosome lengths and the genome build. The isCircular field is rarely used, but it does exist if required.
mcols(genesGR)
If we explore the metadata columns, it appear there is some redundant information here. Although this is not strictly required, we can restrict this information to fields we care about. Let’s keep the columns gene_id, gene_name, gene_biotype, description and entrezid.
cols2Keep <- c("gene_id", "gene_name", "gene_biotype", "description", "entrezid")
mcols(genesGR) <- mcols(genesGR)[,cols2Keep]
mcols(genesGR)
Note that here, we overwrote the mcols data, with a modified version of itself. This strategy is often used to change column names or rownames using the colnames() and rownames() functions.
Another useful trick might be to only keep the genes on chromosome 1, as we know that’s all we have in our object DGE_Results. Notice that the chromosome is part of our GRanges core, in a field called seqnames. We can use this to overwrite our original object.
genesGR <- subset(genesGR, seqnames == "1")
For those who feel uncomfortable just throwing away this information, remember it will only take one line of code to get the full GRanges object back. This is much more useful than trying Undo in Excel!
Also note that we didn’t change the underlying defined genome. If comparing two GRanges objects, the first step an any function will be to check these builds are compatible, so these has left us with a structure that could be compared against another GRanges object, such as one containing transcription factor binding sites or SNPs. We won’t actually do that today, but it’s good practice.
Now we have our gene-level information, the next challenge will be to add this to our DGE_Results object, and this is where Hadley’s programming paradigm conflicts slight with the Bioconductor approach.
The package dplyr is well-equipped for merging two data.frame (or tibble) objects, so a good strategy may be to convert our GRanges object into a data.frame Let’s try this and just have a look. Before we commit to anything, let’s set up a quick look using the magrittr (%>%).
genesGR %>%
as.data.frame() %>%
head()
Notice how the ranges which were on the left of the pipe in the original object have now been separated into individual columns. We also have rownames, which are replicated in the gene_id column. A big advantage of converting the GRanges object into a data.frame is that all of the tools available to us in dplyr become viable.
An example of this might be to summarise our gene_biotypes and see how many we have.
genesGR %>%
as.data.frame() %>%
group_by(gene_biotype) %>%
summarise(Total = n()) %>%
arrange(desc(Total))
The group_by() function is very useful and takes a data.frame as input. Unfortunately, the S4 class DataFrame doesn’t play well with dplyr so this is what underlies our need to convert to a data.frame.
Our primary motivation is to add all of the information from our GRanges object onto our DGE_Results object. Once again, dplyr provides a very useful function called left_join() which combines two data frames based on a common column. If you consider the first data frame to be the one on the left, this explains the terminology. The second data.frame is essentially mapped onto the first one. Alternatively, right_join would map the first data.frame onto the second one.
The common column in both our objects would be GeneID in DGE_Results, which contains the information matching gene_id in the GRanges object. If these were named identically left/right_join() would figure everything out for us automatically. Let’s see this in action.
genesGR %>%
as.data.frame() %>%
right_join(DGE_Results, by = c("gene_id" = "GeneID")) %>%
as_tibble()
Notice that our GRanges object had length(genesGR) genes, whilst now have 1257 genes, as expected.
Alternatively, we could have used the approach
DGE_Results %>%
left_join(as.data.frame(genesGR), by = c("GeneID" = "gene_id"))
The only difference is in the column order, which we can fix using select(), and how easily you can understand the two code chunks.
DGE_Results %>%
left_join(as.data.frame(genesGR), by = c("GeneID" = "gene_id")) %>%
dplyr::select(GeneID, gene_name, logFC, logCPM, PValue, FDR, gene_biotype, description, entrezid)
In the previous chunk, you will have noticed that instead of just using select(), we used dplyr::select(). Unfortunately, dplyr is not the only package with a function called select, and similarly for the functions slice(), filter(), mutate(). If you type ?select in your R Console, the help page will turn up 4 options, and these are the packages currently loaded which have a function called select. By prefacing the function with the name of the package it comes from, and a double colon (i.e. dplyr::) we are telling R to use the function from that package (or more accurately referred to as a namespace). When Hadley wrote dplyr, he chose to use function names based on existing functions in the database language SQL. The other packages you see listed in the help page also use SQL-like syntax, so this is why they all have functions with the same name. If you see an inexplicable message from a function that usually works, this is a common way to fix it, i.e. call the help page using ?functionName and if you get more than one option, explicitly call the one you want using the namespace.
Our motivation for the previous section was to add gene names to the volcano plot. We can do this using geom_text(). However, before we just jump into that, pause for a moment and realise that if we do this, we will print the names for all 1257 genes. This will be an unreadable mess, so let’s introduce one more trick. Inside a piped chain of commands, the object being sent to the function can be called implicitly using the . symbol, and we’ll take advantage of this in the code chunk below. Pay particular attention to the line geom_text(data = . %>% dplyr::slice(1:10)).
DGE_Results %>%
left_join(as.data.frame(genesGR), by = c("GeneID" = "gene_id")) %>%
dplyr::select(GeneID, gene_name, logFC, logCPM, PValue, FDR) %>%
mutate(
Sig = FDR < 0.01 & abs(logFC) > 1,
Direction = case_when(
Sig & logFC > 0 ~ "Up",
Sig & logFC < 0 ~ "Down",
!Sig ~ "Unchanged"
)
) %>%
ggplot(aes(logFC, -log10(PValue), colour = Direction, label = gene_name)) +
geom_point() +
geom_text(data = . %>% dplyr::slice(1:10)) +
scale_colour_manual(values = c("blue", "grey", "red")) +
labs(y = "-log10(p)") +
theme_bw()
Unfortunately, those names seem to lie on top of each other, so we’re going to load another package ggrepel and use the function geom_text_repel(). This is a straight drop-in for geom_text(), but there is some fancy work going on behind the scenes which avoids overlapping labels.
First, place this in your loadPackages chunk.
library(ggrepel)
Now we can swap in the new function. I’ve also added the argument show.legend = FALSE to hide the labels from the legend.
DGE_Results %>%
left_join(as.data.frame(genesGR), by = c("GeneID" = "gene_id")) %>%
dplyr::select(GeneID, gene_name, logFC, logCPM, PValue, FDR) %>%
mutate(
Sig = FDR < 0.01 & abs(logFC) > 1,
Direction = case_when(
Sig & logFC > 0 ~ "Up",
Sig & logFC < 0 ~ "Down",
!Sig ~ "Unchanged"
)
) %>%
ggplot(aes(logFC, -log10(PValue), colour = Direction, label = gene_name)) +
geom_point() +
geom_text_repel(data = . %>% dplyr::slice(1:10), show.legend = FALSE) +
scale_colour_manual(values = c("blue", "grey", "red")) +
labs(y = "-log10(p)") +
theme_bw()
This is now a pretty long code chunk, but hopefully every line makes sense as you read through it. Alternatively, we could’ve created a new object from our merge of DGE_Results with the genesGR data, however, keeping a clean workspace with only a few object can help us avoid some common pitfalls.
Try adding labels in the same manner to a selection of differentially expressed genes in your MD plot.
Another task we may be interested in is seeing which genes with particular functions are differentially expressed, using our fairly broad gene_biotype information. We’ll look at functional enrichment a little later, but this can also be an informative part of any analysis.
Lets repeat most of the above code, but adding the gene_biotype to the dplyr::select() line. What we’ll then do is use group_by() across both gene_biotype and Direction to count how many we have.
DGE_Results %>%
left_join(as.data.frame(genesGR), by = c("GeneID" = "gene_id")) %>%
dplyr::select(GeneID, gene_name, logFC, logCPM, PValue, FDR, gene_biotype) %>%
mutate(
Sig = FDR < 0.01 & abs(logFC) > 1,
Direction = case_when(
Sig & logFC > 0 ~ "Up",
Sig & logFC < 0 ~ "Down",
!Sig ~ "Unchanged"
)
) %>%
group_by(gene_biotype, Direction) %>%
tally()
We can now use geom_bar() to draw a bar plot. Once again, there are a few tricks here. geom_bar() will try to summarise values, so setting stat = "identity" tells it to use the exact numbers we’ve given it. The other more strange looking line position = position_dodge2(preserve = "single") is telling geom_bar() to preserve the width of a single bar. This is helpful when we have zero values for one or more categories, and ensure that we don’t have varying bar widths in these cases.
DGE_Results %>%
left_join(as.data.frame(genesGR), by = c("GeneID" = "gene_id")) %>%
dplyr::select(GeneID, gene_name, logFC, logCPM, PValue, FDR, gene_biotype) %>%
mutate(
Sig = FDR < 0.01 & abs(logFC) > 1,
Direction = case_when(
Sig & logFC > 0 ~ "Up",
Sig & logFC < 0 ~ "Down",
!Sig ~ "Unchanged"
)
) %>%
group_by(gene_biotype, Direction) %>%
tally() %>%
ggplot(aes(gene_biotype, n, fill = Direction)) +
geom_bar(
stat = "identity",
position = position_dodge2(preserve = "single")
) +
theme_bw()
A final useful trick might be to place the biotype on the y-axis, and the best method for this is to use the function coord_flip().
DGE_Results %>%
left_join(as.data.frame(genesGR), by = c("GeneID" = "gene_id")) %>%
dplyr::select(GeneID, gene_name, logFC, logCPM, PValue, FDR, gene_biotype) %>%
mutate(
Sig = FDR < 0.01 & abs(logFC) > 1,
Direction = case_when(
Sig & logFC > 0 ~ "Up",
Sig & logFC < 0 ~ "Down",
!Sig ~ "Unchanged"
)
) %>%
group_by(gene_biotype, Direction) %>%
tally() %>%
ggplot(aes(gene_biotype, n, fill = Direction)) +
geom_bar(
stat = "identity",
position = position_dodge2(preserve = "single")
) +
coord_flip() +
theme_bw()
If we try making the axis labels a bit nicer, we have to call them by the original settings of the aesthetics, not what we see on the plot. Try making nicer axis labels for this plot.
Remake the tables of top 10 up and down regulated genes, including the gene names.