Spring Into Bioinformatics

Introduction
GO Enrichment
- Finding the genes for each term.
- KEGG Enrichment
Conclusion
- The tidyverse

Introduction

Now that we’ve gone through our results a little, we should really dig a little deeper and explore which aspects of the cellular biology are impacted in our comparison. Here we are comparing two fairly dissimilar tissues, whereas in many of our experiments we are comparing a control and a treatment, or a handful of closely related cell types.

The most common approach for testing for biological signal is to use a technique known as enrichment testing to see whether our set of differentially expressed genes are enriched for any signal. Under this approach we use pre-defined gene sets and see if any of these appear to be over-represented in our set of DE genes, compared to the set of genes which are unchanged in our comparison. These gene sets to be tested can be taken from any source, or assembled in any manner, but we commonly use databases such as the Gene Ontology (GO) or Kyoto Encyclopaedia of Genes and Genomes (KEGG). Another common database is MSigDB, however this can be a little more challenging to work with as cross-species and cross-database mappings are usually required.

Today, we’ll simply look at using the hyper-geometric test for enrichment. This is also known as Fisher’s Exact Test and bears a close resemblance to a $\chi^2$ test. Without going into fine statistical details, we form our data into a 2x2 table as follows.

	DE Genes	Not DE Genes
In GeneSet	10	10
Not In GeneSet	90	990

We then look to see if the number of DE genes which belong to our gene set appears to be higher (or lower) than might be expected when comparing to the unchanged genes. In the above example, 1.00% of the unchanged genes are in our gene set, whilst 10.0% of our DE genes are in the gene set. We would probably consider this gene set to be enriched in our set of DE genes.

Although we are using DE genes as our framework here, this can also be applied to any two (or more) comparative groups, such as SNPs associated with differing environmental populations. Fisher’s Exact Test is also not restricted to 2x2 tables, but this is the most common form that we see.

GO Enrichment

As this is a common question we ask of our data, a function has already been written for wider use. This can be found in the Bioconductor package limma, which is one of the most heavily used packages of all time. The function we need in goana() so let’s first add the package to our loadPackages chunk.

library(limma)

Now we can check the help page using ?goana. Here we can see both the goana() and kegga() function documented. The first two are how the function is defined for objects of class MArrayLM, which we haven’t used for our analysis, As a result, the function we’ll need is the one listed under Default S3 method:.

The information we need to pass to this function is a list of identifiers from our DE genes (de) and the total list of genes from which these De genes are drawn (universe). We’ll also need to define the species, which in our case will be “Mm” for Mus musculus

Unfortunately, we need to provide our gene identifier to this function as the NCBI-based EntrezGene identifiers, and we have been using Ensembl identifiers. In general, the opinion of the Bioinformatics Hub is that Ensembl gene-models are far better defined than EntrezGene so we tend to use Ensembl identifiers by default. However, the mappings of gene identifiers to GO terms is handled better by the EntrezGene database, and similarly the databases built by the Bioconductor community are more rich for EntrezGene identifiers, most likely as most developers are US-based. Nothing in bioinformatics is ever easy.

Fortunately, we set up our genesGR object sensibly and you may notice that we have a column of EntrezGene identifiers. We can use these! As we noted earlier, we just need a set of identifiers which can be passed to the function for the de argument, whilst we need the entire set under analysis for the universe. Previously (in a step we’ll cover tomorrow) we removed all of our undetectable genes from our analysis, as we only want to analyse genes that are being expressed in on or more of our experimental groups. This will minimise our spurious or less meaningful results.

The simplest strategy here is to define these as R objects, as we need to do a few filtering steps for each.

uv <- genesGR %>%
    subset(gene_id %in% DGE_Results$GeneID) %>%
    subset(!is.na(entrezid)) %>%
    mcols() %>%
    .[["entrezid"]] %>%
    unlist() %>%
    unique()

This is a long code chunk, so let’s look at what we’ve done.

The first subset(gene_id %in% DGE_Results$GeneID) line ensured that we’ll only keep genes in our ‘universe’ that are our set of expressed genes (i.e. those in the complete set of DGE_Results).

The next subset(!is.na(entrezid)) ensured that any genes without EntrezGene ids were removed, as these are essentially uninformative. There are two useful points here. The test is.na() is used to test for missing (i.e. NA) values in the column entrezid, whilst the placing of an exclamation point beforehand (!) reversed the results, so that TRUE became FALSE and vice-versa.

After this second subset() operation, we just took our mcols() to get the metadata, then we used the double brackets just to extract the single column “entrezid”. The astute amongst you may have noticed that this column is in fact an R object called a list. Without going into too much detail, this enables the mapping of one Ensembl ID to one or more EntrezGene IDs as we can store more than one value for each element. By following our extraction ([[]]) with unlist(), this converted the complex structure into a simple vector. From there we applied the function unique() to ensure every ID only appears once.

To perform a similar operation and get the set of EntrezGene IDs from our set of DE genes, we could try the following. Note that this time we’ve set up our DE genes directly from the DGE_Results object first.

de <- DGE_Results %>%
    dplyr::filter(FDR < 0.01 & abs(logFC) > 1) %>%
    left_join(as.data.frame(genesGR), by = c("GeneID" = "gene_id")) %>%
    dplyr::filter(!is.na(entrezid)) %>%
    .[["entrezid"]] %>%
    unlist() %>%
    unique()

Now we have our two lists of gene IDs, we can just call the function goana(). Under the hood, this will handle all of the mappings from EntrezGene ID to GO terms using a database package called org.Mm.eg.db. We don’t need to know any more than that, but this is way our species is required, as this package name will be automatically loaded using our provided “Mm” abbreviation. This function has also been well written for speed.

goRes <- goana(de, uv, species = "Mm")

This has returned a standard data.frame (with rownames), and we like to use tibble objects. To make this a tibble we’ll need to shift those rownames to a column. There is a function for that in the package tibble, so let’s add this to our loadPackages chunk.

library(tibble)

Now we can just use the function rownames_to_column()

goRes <- goRes %>%
    rownames_to_column("goid") %>%
    as_tibble()
goRes

Here we have over 9000 terms that have been tested, returned in alpha-numeric order based on the GO ID. The column N tells how many genes in the uv object were mapped to the GO term, whilst the column DE tells us how many genes in our de object were mapped to the GO term. The column P.DE contains the two-sided $p$-value from Fisher’s Exact Test, representing an association between the GO term and the set of DE genes.

As we are testing for enrichment, we need a way of finding which $p$-values are from the detection of enrichment and which are from a lack of enrichment. In general, we rarely test for a lack of enrichment unless your articular biological question makes this relevant.

A good way to do this might be to find the proportion of genes from the universe which mapped to this term, and the proportion of genes from de which mapped to this term. If the second proportion is higher, we know our $p$-value is associated with enrichment.

An alternative approach, might be to find the proportion of genes from the universe, then multiply this by the number of DE genes to obtain an Expected value. If DE > Expected, we also have enrichment.

goRes %>%
    mutate(Expected = length(de) * N / length(uv))

We might also like to remove any GO terms not found in our set of DE genes, as we can only test for enrichment of terms found in our results.

goRes %>%
    mutate(Expected = length(de) * N / length(uv)) %>%
    dplyr::filter(DE > 0)

Next, we might like to arrange by $p$-value.

goRes %>%
    mutate(Expected = length(de) * N / length(uv)) %>%
    dplyr::filter(DE > 0) %>%
    arrange(P.DE)

Finally, we’d need to adjust those $p$-values to account for the fact that we have multiple hypothesis test being conducted in parallel.

goRes %>%
    mutate(Expected = length(de) * N / length(uv)) %>%
    dplyr::filter(DE > 0) %>%
    arrange(P.DE) %>%
    mutate(adjP = p.adjust(P.DE, "fdr"))

Surprising, there are no GO terms here that appear to be enriched amongst our set of DE genes! By restricting our analysis to genes on chromosome 1, we’ve certainly limited our power here, but let’s proceed anyway. Maybe a better approach would be to look at up-regulated and down-regulated genes individually. Let’s define a new set of DE genes for up-regulated

up <- DGE_Results %>%
    dplyr::filter(FDR < 0.01 & logFC > 1) %>%
    left_join(as.data.frame(genesGR), by = c("GeneID" = "gene_id")) %>%
    dplyr::filter(!is.na(entrezid)) %>%
    .[["entrezid"]] %>%
    unlist() %>%
    unique()

Now we can repeat our GO analysis, hut this time, we’ll do it all in one big chunk.

upGoRes <- goana(up, uv, species = "Mm") %>%
    rownames_to_column("goid") %>%
    as_tibble() %>%
    mutate(Expected = length(up) * N / length(uv)) %>%
    dplyr::filter(DE > 0) %>%
    arrange(P.DE) %>%
    mutate(adjP = p.adjust(P.DE, "fdr"))

Considering we are comparing muscle cells to brain cells, this makes a fair bit of sense. Let’s produce a table for our report.

upGoRes %>%
    dplyr::filter(
        adjP < 0.05,
        DE > Expected
    ) %>%
    mutate(
        DE = paste(DE, N, sep = "/") 
    ) %>%
    dplyr::select(goid, Term, Ont, Expected, DE, P.DE, adjP) %>%
    pander(
        caption = "Enriched GO terms in the set of up-regulated genes.",
        justify = "lllrrrr",
        style = "rmarkdown",
        split.tables = Inf
    )

Task

Repeat this for down-regulated genes

Finding the genes for each term.

Unfortunately, this is not a particularly well-designed feature in R. The easiest strategy for us, given what we know is to just find the genes for one GO term at a time. This is performed using the get() function, and a mapping object. This mapping object is loaded with the package org.Mm.eg.db. Let’s add this to our loadPackages code chunk.

library(org.Mm.eg.db)

This package comes with a mapping object org.Mm.egGO2ALLEGS and we can just ask for the IDs for each GO term. Taking one of our highly ranked terms gives us the EntrezGene IDs

get("GO:0061061", org.Mm.egGO2ALLEGS)

This is the complete set of IDs, but we can use this to restrict our results to the genes we are interested in.

genesGR %>%
  subset(entrezid %in% get("GO:0061061", org.Mm.egGO2ALLEGS)) %>%
  mcols() %>%
  as.data.frame() %>%
  left_join(DGE_Results, by = c("gene_id" = "GeneID")) %>%
  as_tibble() %>%
  dplyr::filter(FDR < 0.01, logFC > 1) %>%
  dplyr::select(gene_id, gene_name, logFC, logCPM, PValue, FDR)

KEGG Enrichment

Similarly to GO, analysis of gene sets defined by KEGG can be performed using the function kegga().

keggRes <- kegga(up, uv, species = "Mm")

We can use the same approach as for GO enrichment to produce a table of the 10 most enriched pathways

keggRes %>%
  rownames_to_column("keggid") %>%
  as_tibble() %>%
  dplyr::filter(DE > 0) %>%
  mutate(
    Expected = length(up) * N / length(uv),
    adjP = p.adjust(P.DE, "fdr")
  ) %>%
  arrange(P.DE) %>%
  dplyr::filter(DE > Expected) %>%
  dplyr::slice(1:10) %>%
  mutate(DE = paste(DE, N, sep = "/")) %>%
  dplyr::select(keggid, Pathway, Expected, DE, P.DE, adjP)

Task

Repeat this for down-regulated genes

Conclusion

We’ve covered a lot of material today, so take any remaining time to create an R Markdown document, describing everything you’re doing as if you’re communicating with your collaborators, or with a future version of you who can’t remember why you’ve made your decisions.

In this document:

Inspect the $p$-values and explain why
Create an MD plot, labelling the genes of your choice
Create a Volcano plot, labelling the genes of your choice
Create tables of up and down-regulated genes. Choose 10 or 20 or any number you prefer
Describe your criteria for considering a gene to be differentially expressed, and include the numbers of DE genes in your explanatory text
Perform an enrichment test, and create a table summarising your results

The tidyverse

The packages, dplyr, readr, ggplot2 and tibble along with two other packages we’ll learn about later (stringr and tidyr) form what is collectively known as the tidyverse. The package tidyverse is a simple wrapper which loads all of these in one line. In your loadPackages chunk, you can replace all of the above packages with the simple line library(tidyverse).

Spring Into Bioinformatics

Enrichment Analysis

Steve Pederson

02/10/2019

Introduction

GO Enrichment

Task

Finding the genes for each term.

KEGG Enrichment

Task

Conclusion

The tidyverse