Introduction

library(GO.db)
library(graph)
library(dnet)
library(magrittr)
library(tidyverse)
library(pander)
library(scales)
library(plotly)
library(limma)

The aim of this page is to keep an up-to-date summary of the Gene Ontology database, as represented in the R package GO.db. The information contained for each GO term is:

The main object created is available for download here and this will load as a tibble after using the function read_rds() (from the package readr). Alternatively, you can load this directly into your workspace using:

goSummaries <- url("https://uofabioinformaticshub.github.io/summaries2GO/data/goSummaries.RDS") %>%
  readRDS()

Create the Basic Graphs

The package AnnotationDbi contains the function makeGOGraph which returns a graphNEL graph. In the following, we will:

  1. Create a graph for each ontology and remove the node "all" as this is essentially redundant
  2. Reverse the direction of the DAG for compatibility with tools in the package dnet. Whilst hosted on CRAN, this package Depends on the bioconductor package supraHex and should be installed using BiocManager::install("dnet")

Note that the following code can take several minutes to run as these are large graphs.

graphs <- c(BP = "bp", CC = "cc", MF = "mf") %>%
    lapply(makeGOGraph) %>%
    lapply(function(x){removeNode("all", x)}) %>%
    lapply(dDAGreverse)
Summary of graph sizes for each ontology
Ontology Nodes Edges
BP 29,211 70,659
CC 4,184 6,904
MF 11,113 13,561

Find the Key Information

goSummaries <- lapply(graphs, function(x){
    lng <- dDAGlevel(x, "longest_path") - 1
    shrt <- dDAGlevel(x, "shortest_path") - 1
    tips <- dDAGtip(x)
    tibble(
        id = unique(c(names(lng), names(shrt))),
        shortest_path = shrt,
        longest_path = lng,
        terminal_node = id %in% tips
        )
}) %>%
    bind_rows() %>%
    mutate(ontology = Ontology(id))
Path lengths for each ontology, based on whether a term is a terminal node or not.

Path lengths for each ontology, based on whether a term is a terminal node or not.

Cumulative number of GO Terms with paths $\geq$ x.

Cumulative number of GO Terms with paths \(\geq\) x.

Examples

In reality, we can just add this table to our GO analysis table from tools like goana() and use it to filter results before adjusting p-values.

As an example of how to use this to assist our decision making, if we chose to remove GO terms with a shortest path \(\leq\) 4, we can see how many terms we would keep and retain.

ggplotly(goSummaries %>%
    mutate(keep = shortest_path > 4) %>%
    ggplot(aes(keep, fill = terminal_node)) +
    geom_bar() +
    facet_wrap(~ontology, nrow = 1) +
    scale_y_continuous(labels = comma) +
    labs(x = "Term Retained", y = "Total") +
    theme_bw())
goSummaries %>%
    mutate(keep = shortest_path > 4) %>%
    group_by(ontology, terminal_node, keep) %>%
    tally() %>%
    spread(key = keep, value = n) %>%
    rename(Discard = `FALSE`,
           Retain = `TRUE`) %>%
    bind_rows(
        tibble(ontology = "**Total**",
               Discard = sum(.$Discard),
               Retain = sum(.$Retain))) %>%
    pander(big.mark = ",",
           justify = "llrr")
ontology terminal_node Discard Retain
BP FALSE 8,003 9,122
BP TRUE 4,237 7,849
CC FALSE 1,078 323
CC TRUE 2,158 625
MF FALSE 935 1,079
MF TRUE 2,134 6,965
Total NA 18,545 25,963

Alternatively, we could remove GO terms with a longest path back to the root node is \(\leq 5\) steps.

ggplotly(goSummaries %>%
    mutate(keep = longest_path > 5) %>%
    ggplot(aes(keep, fill = terminal_node)) +
    geom_bar() +
    facet_wrap(~ontology, nrow = 1) +
    scale_y_continuous(labels = comma) +
    labs(x = "Term Retained", y = "Total") +
    theme_bw())
goSummaries %>%
    mutate(keep = longest_path > 5) %>%
    group_by(ontology, terminal_node, keep) %>%
    tally() %>%
    spread(key = keep, value = n) %>%
    rename(Discard = `FALSE`,
           Retain = `TRUE`) %>%
    bind_rows(
        tibble(ontology = "**Total**",
               Discard = sum(.$Discard),
               Retain = sum(.$Retain))) %>%
    pander(big.mark = ",",
           justify = "llrr")
ontology terminal_node Discard Retain
BP FALSE 3,109 14,016
BP TRUE 808 11,278
CC FALSE 562 839
CC TRUE 1,060 1,723
MF FALSE 1,209 805
MF TRUE 5,844 3,255
Total NA 12,592 31,916

The summaries obtained above can be downloaded from here. Place this in the appropriate folder and the object can then be imported using read_rds("path/to/goSummaries.RDS")

Use with goana

As an artificial example, let’s use a subset of the genes associated with GO:0005795 (Golgi Stack) as a pretend set of DE genes.

library(org.Hs.eg.db)
set.seed(65)
myDE <- get("GO:0005795", org.Hs.egGO2ALLEGS) %>% sample(50)
goResults <- goana(myDE, species = "Hs")

If we leave these as is, there will be a few high-level GO terms which are relatively meaningless, as well as GO terms not present in our DE genes. These can easily be seen at the top of the Ancestor Chart for this term

goResults %>%
    rownames_to_column("id") %>%
    as_tibble() %>%
    arrange(P.DE) %>% 
    mutate(adjP = p.adjust(P.DE, "bonferroni"),
           FDR = p.adjust(P.DE, "fdr")) %>%
    slice(1:10) %>%
    pander(caption = "Unfiltered results for GO analysis",
           justify = "lllrrrrr",
           split.tables = Inf)
Unfiltered results for GO analysis
id Term Ont N DE P.DE adjP FDR
GO:0005795 Golgi stack CC 149 50 1.153e-111 2.622e-107 2.622e-107
GO:0098791 Golgi apparatus subcompartment CC 359 50 4.473e-90 1.017e-85 5.087e-86
GO:0031984 organelle subcompartment CC 380 50 9.442e-89 2.148e-84 7.159e-85
GO:0031985 Golgi cisterna CC 116 36 2.656e-72 6.041e-68 1.51e-68
GO:0032580 Golgi cisterna membrane CC 92 29 3.747e-57 8.522e-53 1.704e-53
GO:0005794 Golgi apparatus CC 1560 50 5.788e-57 1.317e-52 2.194e-53
GO:0000139 Golgi membrane CC 754 41 1.001e-50 2.278e-46 3.254e-47
GO:0098588 bounding membrane of organelle CC 2064 43 4.534e-36 1.031e-31 1.289e-32
GO:0012505 endomembrane system CC 4441 50 5.068e-34 1.153e-29 1.281e-30
GO:0031090 organelle membrane CC 3485 45 1.638e-29 3.726e-25 3.726e-26

I usually like to rearrange mine as a tibble and do some filtering as well. In the following we’ll:

filteredGO <- goResults %>%
    rownames_to_column("id") %>%
    as_tibble() %>%
    arrange(P.DE) %>%
    left_join(goSummaries) %>%
    filter(DE > 0) %>%
    filter(shortest_path > 4) %>%
    mutate(adjP = p.adjust(P.DE, "bonferroni"),
           FDR = p.adjust(P.DE, "fdr"))

Now we have removed some of the redundant terms, we can just look at the top 10.

Top 10 terms for GO analysis after filtering
id Term Ont N DE P.DE adjP FDR
GO:0032580 Golgi cisterna membrane CC 92 29 3.747e-57 2.51e-54 2.51e-54
GO:0000137 Golgi cis cisterna CC 28 11 1.167e-22 7.821e-20 3.91e-20
GO:0007030 Golgi organization BP 142 12 7.422e-16 4.973e-13 1.658e-13
GO:0009101 glycoprotein biosynthetic process BP 348 14 6.924e-14 4.639e-11 1.16e-11
GO:0006024 glycosaminoglycan biosynthetic process BP 110 7 8.756e-09 5.866e-06 1.173e-06
GO:0006023 aminoglycan biosynthetic process BP 115 7 1.194e-08 8.003e-06 1.334e-06
GO:0006029 proteoglycan metabolic process BP 94 6 1.072e-07 7.181e-05 9.799e-06
GO:0030203 glycosaminoglycan metabolic process BP 160 7 1.17e-07 7.839e-05 9.799e-06
GO:0003831 beta-N-acetylglucosaminylglycopeptide beta-1,4-galactosyltransferase activity MF 5 3 1.363e-07 9.132e-05 1.015e-05
GO:1903510 mucopolysaccharide metabolic process BP 112 6 3.046e-07 0.0002041 2.041e-05

Or we could find which terminal nodes are significant, as these may be especially informative.

filteredGO %>%
    filter(adjP < 0.05, terminal_node) %>%
    dplyr::select(id, Term, Ont, N, DE, P.DE, adjP, FDR) %>%
    pander(caption = "Significant GO terms with no Child Terms",
           justify = "lllrrrrr",
           split.tables = Inf)
Significant GO terms with no Child Terms
id Term Ont N DE P.DE adjP FDR
GO:0003831 beta-N-acetylglucosaminylglycopeptide beta-1,4-galactosyltransferase activity MF 5 3 1.363e-07 9.132e-05 1.015e-05
GO:0030206 chondroitin sulfate biosynthetic process BP 26 3 3.418e-05 0.0229 0.001231
GO:0047238 glucuronosyl-N-acetylgalactosaminyl-proteoglycan 4-beta-N-acetylgalactosaminyltransferase activity MF 4 2 3.491e-05 0.02339 0.001231
GO:0050510 N-acetylgalactosaminyl-proteoglycan 3-beta-glucuronosyltransferase activity MF 4 2 3.491e-05 0.02339 0.001231
GO:0018146 keratan sulfate biosynthetic process BP 28 3 4.292e-05 0.02875 0.001369
GO:0003945 N-acetyllactosamine synthase activity MF 5 2 5.81e-05 0.03892 0.001707

Note that we just randomly grabbed 50 genes from GO:0005795, so this may also highlight issues with the general GO analytic approach and being cautious about low-level terms with only a small number of genes.

Session Info

R version 4.0.2 (2020-06-22)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_AU.UTF-8, LC_NUMERIC=C, LC_TIME=en_AU.UTF-8, LC_COLLATE=en_AU.UTF-8, LC_MONETARY=en_AU.UTF-8, LC_MESSAGES=en_AU.UTF-8, LC_PAPER=en_AU.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_AU.UTF-8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: org.Hs.eg.db(v.3.11.4), limma(v.3.44.3), plotly(v.4.9.2.1), scales(v.1.1.1), pander(v.0.6.3), forcats(v.0.5.0), stringr(v.1.4.0), dplyr(v.1.0.2), purrr(v.0.3.4), readr(v.1.3.1), tidyr(v.1.1.2), tibble(v.3.0.3), ggplot2(v.3.3.2), tidyverse(v.1.3.0), magrittr(v.1.5), dnet(v.1.1.7), supraHex(v.1.26.0), hexbin(v.1.28.1), igraph(v.1.2.5), graph(v.1.66.0), GO.db(v.3.11.4), AnnotationDbi(v.1.50.3), IRanges(v.2.22.2), S4Vectors(v.0.26.1), Biobase(v.2.48.0) and BiocGenerics(v.0.34.0)

loaded via a namespace (and not attached): nlme(v.3.1-149), fs(v.1.5.0), lubridate(v.1.7.9), bit64(v.4.0.5), httr(v.1.4.2), Rgraphviz(v.2.32.0), tools(v.4.0.2), backports(v.1.1.9), R6(v.2.4.1), DBI(v.1.1.0), lazyeval(v.0.2.2), colorspace(v.1.4-1), withr(v.2.2.0), tidyselect(v.1.1.0), bit(v.4.0.4), compiler(v.4.0.2), cli(v.2.0.2), rvest(v.0.3.6), Cairo(v.1.5-12.2), xml2(v.1.3.2), labeling(v.0.3), digest(v.0.6.25), rmarkdown(v.2.3), pkgconfig(v.2.0.3), htmltools(v.0.5.0), highr(v.0.8), dbplyr(v.1.4.4), htmlwidgets(v.1.5.1), rlang(v.0.4.7), readxl(v.1.3.1), rstudioapi(v.0.11), RSQLite(v.2.2.0), generics(v.0.0.2), farver(v.2.0.3), jsonlite(v.1.7.1), crosstalk(v.1.1.0.1), Matrix(v.1.2-18), Rcpp(v.1.0.5), munsell(v.0.5.0), fansi(v.0.4.1), ape(v.5.4-1), lifecycle(v.0.2.0), stringi(v.1.5.3), yaml(v.2.2.1), MASS(v.7.3-52), grid(v.4.0.2), blob(v.1.2.1), crayon(v.1.3.4), lattice(v.0.20-41), haven(v.2.3.1), hms(v.0.5.3), knitr(v.1.29), pillar(v.1.4.6), codetools(v.0.2-16), reprex(v.0.3.0), glue(v.1.4.2), evaluate(v.0.14), data.table(v.1.13.0), modelr(v.0.1.8), vctrs(v.0.3.4), cellranger(v.1.1.0), gtable(v.0.3.0), assertthat(v.0.2.1), xfun(v.0.17), broom(v.0.7.0), viridisLite(v.0.3.0), memoise(v.1.1.0) and ellipsis(v.0.3.1)