library(GO.db)
library(graph)
library(dnet)
library(magrittr)
library(tidyverse)
library(pander)
library(scales)
library(plotly)
library(limma)
The aim of this page is to keep an up-to-date summary of the Gene Ontology database, as represented in the R package GO.db
. The information contained for each GO term is:
The main object created is available for download here and this will load as a tibble
after using the function read_rds()
(from the package readr
). Alternatively, you can load this directly into your workspace using:
goSummaries <- url("https://uofabioinformaticshub.github.io/summaries2GO/data/goSummaries.RDS") %>%
readRDS()
The package AnnotationDbi
contains the function makeGOGraph
which returns a graphNEL
graph. In the following, we will:
"all"
as this is essentially redundantdnet
. Whilst hosted on CRAN, this package Depends on the bioconductor package supraHex
and should be installed using BiocManager::install("dnet")
Note that the following code can take several minutes to run as these are large graphs.
graphs <- c(BP = "bp", CC = "cc", MF = "mf") %>%
lapply(makeGOGraph) %>%
lapply(function(x){removeNode("all", x)}) %>%
lapply(dDAGreverse)
Ontology | Nodes | Edges |
---|---|---|
BP | 29,211 | 70,659 |
CC | 4,184 | 6,904 |
MF | 11,113 | 13,561 |
goSummaries <- lapply(graphs, function(x){
lng <- dDAGlevel(x, "longest_path") - 1
shrt <- dDAGlevel(x, "shortest_path") - 1
tips <- dDAGtip(x)
tibble(
id = unique(c(names(lng), names(shrt))),
shortest_path = shrt,
longest_path = lng,
terminal_node = id %in% tips
)
}) %>%
bind_rows() %>%
mutate(ontology = Ontology(id))
In reality, we can just add this table to our GO analysis table from tools like goana()
and use it to filter results before adjusting p-values.
As an example of how to use this to assist our decision making, if we chose to remove GO terms with a shortest path \(\leq\) 4, we can see how many terms we would keep and retain.
ggplotly(goSummaries %>%
mutate(keep = shortest_path > 4) %>%
ggplot(aes(keep, fill = terminal_node)) +
geom_bar() +
facet_wrap(~ontology, nrow = 1) +
scale_y_continuous(labels = comma) +
labs(x = "Term Retained", y = "Total") +
theme_bw())
goSummaries %>%
mutate(keep = shortest_path > 4) %>%
group_by(ontology, terminal_node, keep) %>%
tally() %>%
spread(key = keep, value = n) %>%
rename(Discard = `FALSE`,
Retain = `TRUE`) %>%
bind_rows(
tibble(ontology = "**Total**",
Discard = sum(.$Discard),
Retain = sum(.$Retain))) %>%
pander(big.mark = ",",
justify = "llrr")
ontology | terminal_node | Discard | Retain |
---|---|---|---|
BP | FALSE | 8,003 | 9,122 |
BP | TRUE | 4,237 | 7,849 |
CC | FALSE | 1,078 | 323 |
CC | TRUE | 2,158 | 625 |
MF | FALSE | 935 | 1,079 |
MF | TRUE | 2,134 | 6,965 |
Total | NA | 18,545 | 25,963 |
Alternatively, we could remove GO terms with a longest path back to the root node is \(\leq 5\) steps.
ggplotly(goSummaries %>%
mutate(keep = longest_path > 5) %>%
ggplot(aes(keep, fill = terminal_node)) +
geom_bar() +
facet_wrap(~ontology, nrow = 1) +
scale_y_continuous(labels = comma) +
labs(x = "Term Retained", y = "Total") +
theme_bw())
goSummaries %>%
mutate(keep = longest_path > 5) %>%
group_by(ontology, terminal_node, keep) %>%
tally() %>%
spread(key = keep, value = n) %>%
rename(Discard = `FALSE`,
Retain = `TRUE`) %>%
bind_rows(
tibble(ontology = "**Total**",
Discard = sum(.$Discard),
Retain = sum(.$Retain))) %>%
pander(big.mark = ",",
justify = "llrr")
ontology | terminal_node | Discard | Retain |
---|---|---|---|
BP | FALSE | 3,109 | 14,016 |
BP | TRUE | 808 | 11,278 |
CC | FALSE | 562 | 839 |
CC | TRUE | 1,060 | 1,723 |
MF | FALSE | 1,209 | 805 |
MF | TRUE | 5,844 | 3,255 |
Total | NA | 12,592 | 31,916 |
The summaries obtained above can be downloaded from here. Place this in the appropriate folder and the object can then be imported using read_rds("path/to/goSummaries.RDS")
goana
As an artificial example, let’s use a subset of the genes associated with GO:0005795
(Golgi Stack) as a pretend set of DE genes.
library(org.Hs.eg.db)
set.seed(65)
myDE <- get("GO:0005795", org.Hs.egGO2ALLEGS) %>% sample(50)
goResults <- goana(myDE, species = "Hs")
If we leave these as is, there will be a few high-level GO terms which are relatively meaningless, as well as GO terms not present in our DE genes. These can easily be seen at the top of the Ancestor Chart for this term
goResults %>%
rownames_to_column("id") %>%
as_tibble() %>%
arrange(P.DE) %>%
mutate(adjP = p.adjust(P.DE, "bonferroni"),
FDR = p.adjust(P.DE, "fdr")) %>%
slice(1:10) %>%
pander(caption = "Unfiltered results for GO analysis",
justify = "lllrrrrr",
split.tables = Inf)
id | Term | Ont | N | DE | P.DE | adjP | FDR |
---|---|---|---|---|---|---|---|
GO:0005795 | Golgi stack | CC | 149 | 50 | 1.153e-111 | 2.622e-107 | 2.622e-107 |
GO:0098791 | Golgi apparatus subcompartment | CC | 359 | 50 | 4.473e-90 | 1.017e-85 | 5.087e-86 |
GO:0031984 | organelle subcompartment | CC | 380 | 50 | 9.442e-89 | 2.148e-84 | 7.159e-85 |
GO:0031985 | Golgi cisterna | CC | 116 | 36 | 2.656e-72 | 6.041e-68 | 1.51e-68 |
GO:0032580 | Golgi cisterna membrane | CC | 92 | 29 | 3.747e-57 | 8.522e-53 | 1.704e-53 |
GO:0005794 | Golgi apparatus | CC | 1560 | 50 | 5.788e-57 | 1.317e-52 | 2.194e-53 |
GO:0000139 | Golgi membrane | CC | 754 | 41 | 1.001e-50 | 2.278e-46 | 3.254e-47 |
GO:0098588 | bounding membrane of organelle | CC | 2064 | 43 | 4.534e-36 | 1.031e-31 | 1.289e-32 |
GO:0012505 | endomembrane system | CC | 4441 | 50 | 5.068e-34 | 1.153e-29 | 1.281e-30 |
GO:0031090 | organelle membrane | CC | 3485 | 45 | 1.638e-29 | 3.726e-25 | 3.726e-26 |
I usually like to rearrange mine as a tibble
and do some filtering as well. In the following we’ll:
filteredGO <- goResults %>%
rownames_to_column("id") %>%
as_tibble() %>%
arrange(P.DE) %>%
left_join(goSummaries) %>%
filter(DE > 0) %>%
filter(shortest_path > 4) %>%
mutate(adjP = p.adjust(P.DE, "bonferroni"),
FDR = p.adjust(P.DE, "fdr"))
Now we have removed some of the redundant terms, we can just look at the top 10.
id | Term | Ont | N | DE | P.DE | adjP | FDR |
---|---|---|---|---|---|---|---|
GO:0032580 | Golgi cisterna membrane | CC | 92 | 29 | 3.747e-57 | 2.51e-54 | 2.51e-54 |
GO:0000137 | Golgi cis cisterna | CC | 28 | 11 | 1.167e-22 | 7.821e-20 | 3.91e-20 |
GO:0007030 | Golgi organization | BP | 142 | 12 | 7.422e-16 | 4.973e-13 | 1.658e-13 |
GO:0009101 | glycoprotein biosynthetic process | BP | 348 | 14 | 6.924e-14 | 4.639e-11 | 1.16e-11 |
GO:0006024 | glycosaminoglycan biosynthetic process | BP | 110 | 7 | 8.756e-09 | 5.866e-06 | 1.173e-06 |
GO:0006023 | aminoglycan biosynthetic process | BP | 115 | 7 | 1.194e-08 | 8.003e-06 | 1.334e-06 |
GO:0006029 | proteoglycan metabolic process | BP | 94 | 6 | 1.072e-07 | 7.181e-05 | 9.799e-06 |
GO:0030203 | glycosaminoglycan metabolic process | BP | 160 | 7 | 1.17e-07 | 7.839e-05 | 9.799e-06 |
GO:0003831 | beta-N-acetylglucosaminylglycopeptide beta-1,4-galactosyltransferase activity | MF | 5 | 3 | 1.363e-07 | 9.132e-05 | 1.015e-05 |
GO:1903510 | mucopolysaccharide metabolic process | BP | 112 | 6 | 3.046e-07 | 0.0002041 | 2.041e-05 |
Or we could find which terminal nodes are significant, as these may be especially informative.
filteredGO %>%
filter(adjP < 0.05, terminal_node) %>%
dplyr::select(id, Term, Ont, N, DE, P.DE, adjP, FDR) %>%
pander(caption = "Significant GO terms with no Child Terms",
justify = "lllrrrrr",
split.tables = Inf)
id | Term | Ont | N | DE | P.DE | adjP | FDR |
---|---|---|---|---|---|---|---|
GO:0003831 | beta-N-acetylglucosaminylglycopeptide beta-1,4-galactosyltransferase activity | MF | 5 | 3 | 1.363e-07 | 9.132e-05 | 1.015e-05 |
GO:0030206 | chondroitin sulfate biosynthetic process | BP | 26 | 3 | 3.418e-05 | 0.0229 | 0.001231 |
GO:0047238 | glucuronosyl-N-acetylgalactosaminyl-proteoglycan 4-beta-N-acetylgalactosaminyltransferase activity | MF | 4 | 2 | 3.491e-05 | 0.02339 | 0.001231 |
GO:0050510 | N-acetylgalactosaminyl-proteoglycan 3-beta-glucuronosyltransferase activity | MF | 4 | 2 | 3.491e-05 | 0.02339 | 0.001231 |
GO:0018146 | keratan sulfate biosynthetic process | BP | 28 | 3 | 4.292e-05 | 0.02875 | 0.001369 |
GO:0003945 | N-acetyllactosamine synthase activity | MF | 5 | 2 | 5.81e-05 | 0.03892 | 0.001707 |
Note that we just randomly grabbed 50 genes from GO:0005795
, so this may also highlight issues with the general GO analytic approach and being cautious about low-level terms with only a small number of genes.
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_AU.UTF-8, LC_NUMERIC=C, LC_TIME=en_AU.UTF-8, LC_COLLATE=en_AU.UTF-8, LC_MONETARY=en_AU.UTF-8, LC_MESSAGES=en_AU.UTF-8, LC_PAPER=en_AU.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_AU.UTF-8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: org.Hs.eg.db(v.3.11.4), limma(v.3.44.3), plotly(v.4.9.2.1), scales(v.1.1.1), pander(v.0.6.3), forcats(v.0.5.0), stringr(v.1.4.0), dplyr(v.1.0.2), purrr(v.0.3.4), readr(v.1.3.1), tidyr(v.1.1.2), tibble(v.3.0.3), ggplot2(v.3.3.2), tidyverse(v.1.3.0), magrittr(v.1.5), dnet(v.1.1.7), supraHex(v.1.26.0), hexbin(v.1.28.1), igraph(v.1.2.5), graph(v.1.66.0), GO.db(v.3.11.4), AnnotationDbi(v.1.50.3), IRanges(v.2.22.2), S4Vectors(v.0.26.1), Biobase(v.2.48.0) and BiocGenerics(v.0.34.0)
loaded via a namespace (and not attached): nlme(v.3.1-149), fs(v.1.5.0), lubridate(v.1.7.9), bit64(v.4.0.5), httr(v.1.4.2), Rgraphviz(v.2.32.0), tools(v.4.0.2), backports(v.1.1.9), R6(v.2.4.1), DBI(v.1.1.0), lazyeval(v.0.2.2), colorspace(v.1.4-1), withr(v.2.2.0), tidyselect(v.1.1.0), bit(v.4.0.4), compiler(v.4.0.2), cli(v.2.0.2), rvest(v.0.3.6), Cairo(v.1.5-12.2), xml2(v.1.3.2), labeling(v.0.3), digest(v.0.6.25), rmarkdown(v.2.3), pkgconfig(v.2.0.3), htmltools(v.0.5.0), highr(v.0.8), dbplyr(v.1.4.4), htmlwidgets(v.1.5.1), rlang(v.0.4.7), readxl(v.1.3.1), rstudioapi(v.0.11), RSQLite(v.2.2.0), generics(v.0.0.2), farver(v.2.0.3), jsonlite(v.1.7.1), crosstalk(v.1.1.0.1), Matrix(v.1.2-18), Rcpp(v.1.0.5), munsell(v.0.5.0), fansi(v.0.4.1), ape(v.5.4-1), lifecycle(v.0.2.0), stringi(v.1.5.3), yaml(v.2.2.1), MASS(v.7.3-52), grid(v.4.0.2), blob(v.1.2.1), crayon(v.1.3.4), lattice(v.0.20-41), haven(v.2.3.1), hms(v.0.5.3), knitr(v.1.29), pillar(v.1.4.6), codetools(v.0.2-16), reprex(v.0.3.0), glue(v.1.4.2), evaluate(v.0.14), data.table(v.1.13.0), modelr(v.0.1.8), vctrs(v.0.3.4), cellranger(v.1.1.0), gtable(v.0.3.0), assertthat(v.0.2.1), xfun(v.0.17), broom(v.0.7.0), viridisLite(v.0.3.0), memoise(v.1.1.0) and ellipsis(v.0.3.1)