Make GO Summaries

Introduction

library(GO.db)
library(graph)
library(dnet)
library(magrittr)
library(tidyverse)
library(pander)
library(scales)
library(plotly)
library(limma)

The aim of this page is to keep an up-to-date summary of the Gene Ontology database, as represented in the R package GO.db. The information contained for each GO term is:

Which Ontology the GO term belongs to
The shortest path back to the root node of the relevant ontology
The longest path back to the root node of the relevant ontology
Whether the GO term is a terminal node (i.e. it has no children)

The main object created is available for download here and this will load as a tibble after using the function read_rds() (from the package readr). Alternatively, you can load this directly into your workspace using:

goSummaries <- url("https://uofabioinformaticshub.github.io/summaries2GO/data/goSummaries.RDS") %>%
  readRDS()

Create the Basic Graphs

The package AnnotationDbi contains the function makeGOGraph which returns a graphNEL graph. In the following, we will:

Create a graph for each ontology and remove the node "all" as this is essentially redundant
Reverse the direction of the DAG for compatibility with tools in the package dnet. Whilst hosted on CRAN, this package Depends on the bioconductor package supraHex and should be installed using BiocManager::install("dnet")

Note that the following code can take several minutes to run as these are large graphs.

graphs <- c(BP = "bp", CC = "cc", MF = "mf") %>%
    lapply(makeGOGraph) %>%
    lapply(function(x){removeNode("all", x)}) %>%
    lapply(dDAGreverse)

Summary of graph sizes for each ontology
Ontology	Nodes	Edges
BP	29,211	70,659
CC	4,184	6,904
MF	11,113	13,561

Find the Key Information

goSummaries <- lapply(graphs, function(x){
    lng <- dDAGlevel(x, "longest_path") - 1
    shrt <- dDAGlevel(x, "shortest_path") - 1
    tips <- dDAGtip(x)
    tibble(
        id = unique(c(names(lng), names(shrt))),
        shortest_path = shrt,
        longest_path = lng,
        terminal_node = id %in% tips
        )
}) %>%
    bind_rows() %>%
    mutate(ontology = Ontology(id))

Path lengths for each ontology, based on whether a term is a terminal node or not.

$Cumulative number of GO Terms with paths $\geq$ x.$

Cumulative number of GO Terms with paths $\geq$ x.

Examples

In reality, we can just add this table to our GO analysis table from tools like goana() and use it to filter results before adjusting p-values.

As an example of how to use this to assist our decision making, if we chose to remove GO terms with a shortest path $\leq$ 4, we can see how many terms we would keep and retain.

ggplotly(goSummaries %>%
    mutate(keep = shortest_path > 4) %>%
    ggplot(aes(keep, fill = terminal_node)) +
    geom_bar() +
    facet_wrap(~ontology, nrow = 1) +
    scale_y_continuous(labels = comma) +
    labs(x = "Term Retained", y = "Total") +
    theme_bw())

goSummaries %>%
    mutate(keep = shortest_path > 4) %>%
    group_by(ontology, terminal_node, keep) %>%
    tally() %>%
    spread(key = keep, value = n) %>%
    rename(Discard = `FALSE`,
           Retain = `TRUE`) %>%
    bind_rows(
        tibble(ontology = "**Total**",
               Discard = sum(.$Discard),
               Retain = sum(.$Retain))) %>%
    pander(big.mark = ",",
           justify = "llrr")

ontology	terminal_node	Discard	Retain
BP	FALSE	8,003	9,122
BP	TRUE	4,237	7,849
CC	FALSE	1,078	323
CC	TRUE	2,158	625
MF	FALSE	935	1,079
MF	TRUE	2,134	6,965
Total	NA	18,545	25,963

Alternatively, we could remove GO terms with a longest path back to the root node is $\leq 5$ steps.

ggplotly(goSummaries %>%
    mutate(keep = longest_path > 5) %>%
    ggplot(aes(keep, fill = terminal_node)) +
    geom_bar() +
    facet_wrap(~ontology, nrow = 1) +
    scale_y_continuous(labels = comma) +
    labs(x = "Term Retained", y = "Total") +
    theme_bw())

goSummaries %>%
    mutate(keep = longest_path > 5) %>%
    group_by(ontology, terminal_node, keep) %>%
    tally() %>%
    spread(key = keep, value = n) %>%
    rename(Discard = `FALSE`,
           Retain = `TRUE`) %>%
    bind_rows(
        tibble(ontology = "**Total**",
               Discard = sum(.$Discard),
               Retain = sum(.$Retain))) %>%
    pander(big.mark = ",",
           justify = "llrr")

ontology	terminal_node	Discard	Retain
BP	FALSE	3,109	14,016
BP	TRUE	808	11,278
CC	FALSE	562	839
CC	TRUE	1,060	1,723
MF	FALSE	1,209	805
MF	TRUE	5,844	3,255
Total	NA	12,592	31,916

The summaries obtained above can be downloaded from here. Place this in the appropriate folder and the object can then be imported using read_rds("path/to/goSummaries.RDS")

Use with `goana`

As an artificial example, let’s use a subset of the genes associated with GO:0005795 (Golgi Stack) as a pretend set of DE genes.

library(org.Hs.eg.db)
set.seed(65)
myDE <- get("GO:0005795", org.Hs.egGO2ALLEGS) %>% sample(50)
goResults <- goana(myDE, species = "Hs")

If we leave these as is, there will be a few high-level GO terms which are relatively meaningless, as well as GO terms not present in our DE genes. These can easily be seen at the top of the Ancestor Chart for this term

goResults %>%
    rownames_to_column("id") %>%
    as_tibble() %>%
    arrange(P.DE) %>% 
    mutate(adjP = p.adjust(P.DE, "bonferroni"),
           FDR = p.adjust(P.DE, "fdr")) %>%
    slice(1:10) %>%
    pander(caption = "Unfiltered results for GO analysis",
           justify = "lllrrrrr",
           split.tables = Inf)

Unfiltered results for GO analysis
id	Term	Ont	N	DE	P.DE	adjP	FDR
GO:0005795	Golgi stack	CC	149	50	1.153e-111	2.622e-107	2.622e-107
GO:0098791	Golgi apparatus subcompartment	CC	359	50	4.473e-90	1.017e-85	5.087e-86
GO:0031984	organelle subcompartment	CC	380	50	9.442e-89	2.148e-84	7.159e-85
GO:0031985	Golgi cisterna	CC	116	36	2.656e-72	6.041e-68	1.51e-68
GO:0032580	Golgi cisterna membrane	CC	92	29	3.747e-57	8.522e-53	1.704e-53
GO:0005794	Golgi apparatus	CC	1560	50	5.788e-57	1.317e-52	2.194e-53
GO:0000139	Golgi membrane	CC	754	41	1.001e-50	2.278e-46	3.254e-47
GO:0098588	bounding membrane of organelle	CC	2064	43	4.534e-36	1.031e-31	1.289e-32
GO:0012505	endomembrane system	CC	4441	50	5.068e-34	1.153e-29	1.281e-30
GO:0031090	organelle membrane	CC	3485	45	1.638e-29	3.726e-25	3.726e-26

I usually like to rearrange mine as a tibble and do some filtering as well. In the following we’ll:

Arrange by P value
Add the information from goSummaries
Remove GO terms not represented in our DE genes
Restrict our results to a specific subset based on distance from the root node
Adjust p-values

filteredGO <- goResults %>%
    rownames_to_column("id") %>%
    as_tibble() %>%
    arrange(P.DE) %>%
    left_join(goSummaries) %>%
    filter(DE > 0) %>%
    filter(shortest_path > 4) %>%
    mutate(adjP = p.adjust(P.DE, "bonferroni"),
           FDR = p.adjust(P.DE, "fdr"))

Now we have removed some of the redundant terms, we can just look at the top 10.

Top 10 terms for GO analysis after filtering
id	Term	Ont	N	DE	P.DE	adjP	FDR
GO:0032580	Golgi cisterna membrane	CC	92	29	3.747e-57	2.51e-54	2.51e-54
GO:0000137	Golgi cis cisterna	CC	28	11	1.167e-22	7.821e-20	3.91e-20
GO:0007030	Golgi organization	BP	142	12	7.422e-16	4.973e-13	1.658e-13
GO:0009101	glycoprotein biosynthetic process	BP	348	14	6.924e-14	4.639e-11	1.16e-11
GO:0006024	glycosaminoglycan biosynthetic process	BP	110	7	8.756e-09	5.866e-06	1.173e-06
GO:0006023	aminoglycan biosynthetic process	BP	115	7	1.194e-08	8.003e-06	1.334e-06
GO:0006029	proteoglycan metabolic process	BP	94	6	1.072e-07	7.181e-05	9.799e-06
GO:0030203	glycosaminoglycan metabolic process	BP	160	7	1.17e-07	7.839e-05	9.799e-06
GO:0003831	beta-N-acetylglucosaminylglycopeptide beta-1,4-galactosyltransferase activity	MF	5	3	1.363e-07	9.132e-05	1.015e-05
GO:1903510	mucopolysaccharide metabolic process	BP	112	6	3.046e-07	0.0002041	2.041e-05

Or we could find which terminal nodes are significant, as these may be especially informative.

filteredGO %>%
    filter(adjP < 0.05, terminal_node) %>%
    dplyr::select(id, Term, Ont, N, DE, P.DE, adjP, FDR) %>%
    pander(caption = "Significant GO terms with no Child Terms",
           justify = "lllrrrrr",
           split.tables = Inf)

Significant GO terms with no Child Terms
id	Term	Ont	N	DE	P.DE	adjP	FDR
GO:0003831	beta-N-acetylglucosaminylglycopeptide beta-1,4-galactosyltransferase activity	MF	5	3	1.363e-07	9.132e-05	1.015e-05
GO:0030206	chondroitin sulfate biosynthetic process	BP	26	3	3.418e-05	0.0229	0.001231
GO:0047238	glucuronosyl-N-acetylgalactosaminyl-proteoglycan 4-beta-N-acetylgalactosaminyltransferase activity	MF	4	2	3.491e-05	0.02339	0.001231
GO:0050510	N-acetylgalactosaminyl-proteoglycan 3-beta-glucuronosyltransferase activity	MF	4	2	3.491e-05	0.02339	0.001231
GO:0018146	keratan sulfate biosynthetic process	BP	28	3	4.292e-05	0.02875	0.001369
GO:0003945	N-acetyllactosamine synthase activity	MF	5	2	5.81e-05	0.03892	0.001707

Note that we just randomly grabbed 50 genes from GO:0005795, so this may also highlight issues with the general GO analytic approach and being cautious about low-level terms with only a small number of genes.

Session Info

R version 4.0.2 (2020-06-22)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_AU.UTF-8, LC_NUMERIC=C, LC_TIME=en_AU.UTF-8, LC_COLLATE=en_AU.UTF-8, LC_MONETARY=en_AU.UTF-8, LC_MESSAGES=en_AU.UTF-8, LC_PAPER=en_AU.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_AU.UTF-8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: org.Hs.eg.db(v.3.11.4), limma(v.3.44.3), plotly(v.4.9.2.1), scales(v.1.1.1), pander(v.0.6.3), forcats(v.0.5.0), stringr(v.1.4.0), dplyr(v.1.0.2), purrr(v.0.3.4), readr(v.1.3.1), tidyr(v.1.1.2), tibble(v.3.0.3), ggplot2(v.3.3.2), tidyverse(v.1.3.0), magrittr(v.1.5), dnet(v.1.1.7), supraHex(v.1.26.0), hexbin(v.1.28.1), igraph(v.1.2.5), graph(v.1.66.0), GO.db(v.3.11.4), AnnotationDbi(v.1.50.3), IRanges(v.2.22.2), S4Vectors(v.0.26.1), Biobase(v.2.48.0) and BiocGenerics(v.0.34.0)

loaded via a namespace (and not attached): nlme(v.3.1-149), fs(v.1.5.0), lubridate(v.1.7.9), bit64(v.4.0.5), httr(v.1.4.2), Rgraphviz(v.2.32.0), tools(v.4.0.2), backports(v.1.1.9), R6(v.2.4.1), DBI(v.1.1.0), lazyeval(v.0.2.2), colorspace(v.1.4-1), withr(v.2.2.0), tidyselect(v.1.1.0), bit(v.4.0.4), compiler(v.4.0.2), cli(v.2.0.2), rvest(v.0.3.6), Cairo(v.1.5-12.2), xml2(v.1.3.2), labeling(v.0.3), digest(v.0.6.25), rmarkdown(v.2.3), pkgconfig(v.2.0.3), htmltools(v.0.5.0), highr(v.0.8), dbplyr(v.1.4.4), htmlwidgets(v.1.5.1), rlang(v.0.4.7), readxl(v.1.3.1), rstudioapi(v.0.11), RSQLite(v.2.2.0), generics(v.0.0.2), farver(v.2.0.3), jsonlite(v.1.7.1), crosstalk(v.1.1.0.1), Matrix(v.1.2-18), Rcpp(v.1.0.5), munsell(v.0.5.0), fansi(v.0.4.1), ape(v.5.4-1), lifecycle(v.0.2.0), stringi(v.1.5.3), yaml(v.2.2.1), MASS(v.7.3-52), grid(v.4.0.2), blob(v.1.2.1), crayon(v.1.3.4), lattice(v.0.20-41), haven(v.2.3.1), hms(v.0.5.3), knitr(v.1.29), pillar(v.1.4.6), codetools(v.0.2-16), reprex(v.0.3.0), glue(v.1.4.2), evaluate(v.0.14), data.table(v.1.13.0), modelr(v.0.1.8), vctrs(v.0.3.4), cellranger(v.1.1.0), gtable(v.0.3.0), assertthat(v.0.2.1), xfun(v.0.17), broom(v.0.7.0), viridisLite(v.0.3.0), memoise(v.1.1.0) and ellipsis(v.0.3.1)

Make GO Summaries

Steve Pederson

15 September, 2020

Introduction

Create the Basic Graphs

Find the Key Information

Examples

Use with `goana`

Session Info

Make GO Summaries

Steve Pederson

15 September, 2020

Introduction

Create the Basic Graphs

Find the Key Information

Examples

Use with goana

Session Info

Use with `goana`