Last updated: 2020-04-02

Checks: 7 0

Knit directory: 20170327_Psen2S4Ter_RNASeq/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200119)

The command set.seed(20200119) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: e928a97

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  basic_sample_checklist.txt
    Untracked:  experiment_paired_fastq_spreadsheet_template.txt

Unstaged changes:
    Modified:   analysis/_site.yml
    Modified:   data/cpmPostNorm.rds
    Modified:   data/dgeList.rds
    Modified:   data/fit.rds
    Modified:   output/Enrichment_Hom_V_Het.csv
    Modified:   output/psen2HomVsHet.csv
    Modified:   output/psen2VsWT.csv

Staged changes:
    Modified:   analysis/_site.yml

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	33f5f93	Steve Ped	2020-03-07	Updated html after minor formatting
Rmd	65ae03b	Steve Ped	2020-03-07	Typo
Rmd	3b120d9	Steve Ped	2020-03-07	Corrected table justifications
html	a5cdf5d	Steve Ped	2020-03-06	Finished Revisions
Rmd	2772d42	Steve Ped	2020-03-06	Corrected UpSet names
Rmd	a9781d9	Steve Ped	2020-03-06	Fixed duplicate labels
Rmd	154085b	Steve Ped	2020-03-06	Reduced ranked lists to fry only
Rmd	2aa6e06	Steve Ped	2020-03-06	Removed gsea, camera & combining data. Now only using fry + goseq
Rmd	433a729	Steve Ped	2020-03-05	Prepared to remove EGSEA-type approach
Rmd	956c065	Steve Ped	2020-03-05	Revised GO exploration
html	1385ceb	Steve Ped	2020-03-05	Compiled after renaming columns
Rmd	2c56cef	Steve Ped	2020-03-05	Tidied column names for enrichment output
html	7fe1505	Steve Ped	2020-03-05	Updated all analyses & heatmaps after changing to cpmPostNorm instead of fit\(fitted.values for enrichment/heatmaps</td> </tr> <tr> <td>Rmd</td> <td><a href="https://github.com/UofABioinformaticsHub/20170327_Psen2S4Ter_RNASeq/blob/bad81c1731ba8bc4487d25ec2719e92ae6917e4a/analysis/3_Enrichment_MutantVsWT.Rmd" target="_blank">bad81c1</a></td> <td>Steve Ped</td> <td>2020-03-05</td> <td>Updated MutVsWt UpSet plots for KEGG/HALLMARK</td> </tr> <tr> <td>Rmd</td> <td><a href="https://github.com/UofABioinformaticsHub/20170327_Psen2S4Ter_RNASeq/blob/377410a12ca13a7e233e39fb2f77f165e846d830/analysis/3_Enrichment_MutantVsWT.Rmd" target="_blank">377410a</a></td> <td>Steve Ped</td> <td>2020-03-05</td> <td>Finished revision of Enrichment Analysis</td> </tr> <tr> <td>Rmd</td> <td><a href="https://github.com/UofABioinformaticsHub/20170327_Psen2S4Ter_RNASeq/blob/c0b89ed8f761b0c7222538c353d21cd5aa402b94/analysis/3_Enrichment_MutantVsWT.Rmd" target="_blank">c0b89ed</a></td> <td>Steve Ped</td> <td>2020-03-05</td> <td>Reran after changing to cpmPotNorm instead of the incorrect fit\)fitted.values
html	7e59d3c	Steve Ped	2020-02-20	Recompiled
Rmd	4ede752	Steve Ped	2020-02-20	Added headings
html	0168824	Steve Ped	2020-02-20	Rebult after resizing
Rmd	de9c4b3	Steve Ped	2020-02-20	Resized heatmaps again
html	86ffa07	Steve Ped	2020-02-20	Updated saample layout figure
Rmd	4d2af23	Steve Ped	2020-02-19	Modified heatmap sizes & initial network layout
Rmd	8fe35b0	Steve Ped	2020-02-19	Added multiple heatmaps & UpSet plots
html	679b10d	Steve Ped	2020-02-19	Rendered OXPHOS Plot
Rmd	66ba4e0	Steve Ped	2020-02-19	Added OXPHOS plot
html	e0288c6	Steve Ped	2020-02-19	Generated enrichment tables
Rmd	541fbf0	Steve Ped	2020-02-19	Revised Hom Vs Het Enrichment
Rmd	77f167d	Steve Ped	2020-02-19	Revised enrichment
html	876e40f	Steve Ped	2020-02-17	Compiled after minor corrections
Rmd	53ed0e3	Steve Ped	2020-02-17	Corrected Ensembl Release
Rmd	f55be85	Steve Ped	2020-02-17	Added commas
Rmd	8f29458	Steve Ped	2020-02-17	Minor tweaks to formatting
html	9104ecd	Steve Ped	2020-01-28	First draft of Hom Vs Het
Rmd	207cdc8	Steve Ped	2020-01-28	Added code for Hom Vs Het Enrichment
Rmd	468e6e3	Steve Ped	2020-01-28	Ran enrichment of Mut Vs WT
html	468e6e3	Steve Ped	2020-01-28	Ran enrichment of Mut Vs WT
Rmd	3a9933c	Steve Ped	2020-01-28	Finished Enrichment analysis on MutVsWt
html	3a9933c	Steve Ped	2020-01-28	Finished Enrichment analysis on MutVsWt

Setup

library(tidyverse)
library(magrittr)
library(edgeR)
library(scales)
library(pander)
library(goseq)
library(msigdbr)
library(AnnotationDbi)
library(RColorBrewer)
library(ngsReports)
library(UpSetR)
library(pheatmap)

theme_set(theme_bw())
panderOptions("table.split.table", Inf)
panderOptions("table.style", "rmarkdown")
panderOptions("big.mark", ",")

samples <- read_csv("data/samples.csv") %>%
  distinct(sampleName, .keep_all = TRUE) %>%
  dplyr::select(sample = sampleName, sampleID, genotype) %>%
  mutate(
    genotype = factor(genotype, levels = c("WT", "Het", "Hom")),
    mutant = genotype %in% c("Het", "Hom"),
    homozygous = genotype == "Hom"
  )
genoCols <- samples$genotype %>%
  levels() %>%
  length() %>%
  brewer.pal("Set1") %>%
  setNames(levels(samples$genotype))

dgeList <- here::here("data/dgeList.rds") %>% read_rds()
cpmPostNorm <- here::here("data/cpmPostNorm.rds") %>% read_rds()
entrezGenes <- dgeList$genes %>%
  dplyr::filter(!is.na(entrezid)) %>%
  unnest(entrezid) %>%
  dplyr::rename(entrez_gene = entrezid)

deTable <- here::here("output", "psen2VsWT.csv") %>% 
  read_csv() %>%
  mutate(
    entrezid = dgeList$genes$entrezid[gene_id]
  )

formatP <- function(p, m = 0.0001){
  out <- rep("", length(p))
  out[p < m] <- sprintf("%.2e", p[p<m])
  out[p >= m] <- sprintf("%.4f", p[p>=m])
  out
}

Introduction

Enrichment analysis for this dataset present some challenges. Despite normalisation to account for gene length and GC bias, some appeared to still be present in the final results. In addition, the confounding of incomplete rRNA removal with genotype may lead to other distortions in both DE genes and ranking statistics.

Two steps for enrichment analysis will be undertaken.

Testing for enrichment within discrete sets of DE genes as defined in the previous steps
Testing for enrichment using the complete set of results using fry.

Testing for enrichment within discrete gene sets will be performed using goseq as this allows for the incorporation of a single covariate as a predictor of differential expression. GC content, gene length and correlation with rRNA removal can all be supplied as separate covariates.

For enrichment within larger gene lists, fry can take into account inter-gene correlations. Values supplied will be logCPM for each gene/sample after being adjusted for GC and length biases.

Databases used for testing

Data was sourced using the msigdbr package. The initial database used for testing was the Hallmark Gene Sets, with mappings from gene-set to EntrezGene IDs performed by the package authors.

Hallmark Gene Sets

hm <- msigdbr("Danio rerio", category = "H")  %>% 
  left_join(entrezGenes) %>%
  dplyr::filter(!is.na(gene_id)) %>%
  distinct(gs_name, gene_id, .keep_all = TRUE)
hmByGene <- hm %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
hmByID <- hm %>%
  split(f = .$gs_name) %>%
  lapply(extract2, "gene_id")

Mappings are required from gene to pathway, and Ensembl identifiers were used to map from gene to pathway, based on the mappings in the previously used annotations (Ensembl Release 98). A total of 3,375 Ensembl IDs were mapped to pathways from the hallmark gene sets.

KEGG Gene Sets

kg <- msigdbr("Danio rerio", category = "C2", subcategory = "CP:KEGG")  %>% 
  left_join(entrezGenes) %>%
  dplyr::filter(!is.na(gene_id)) %>%
  distinct(gs_name, gene_id, .keep_all = TRUE)
kgByGene <- kg  %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
kgByID <- kg  %>%
  split(f = .$gs_name) %>%
  lapply(extract2, "gene_id")

The same mapping process was applied to KEGG gene sets. A total of 3,530 Ensembl IDs were mapped to pathways from the KEGG gene sets.

Gene Ontology Gene Sets

goSummaries <- url("https://uofabioinformaticshub.github.io/summaries2GO/data/goSummaries.RDS") %>%
  readRDS() %>%
  mutate(
    Term = Term(id),
    gs_name = Term %>% str_to_upper() %>% str_replace_all("[ -]", "_"),
    gs_name = paste0("GO_", gs_name)
    )
minPath <- 3

go <- msigdbr("Danio rerio", category = "C5") %>% 
  left_join(entrezGenes) %>%
  dplyr::filter(!is.na(gene_id)) %>%
  left_join(goSummaries) %>% 
  dplyr::filter(shortest_path >= minPath) %>%
  distinct(gs_name, gene_id, .keep_all = TRUE)
goByGene <- go %>%
  split(f = .$gene_id) %>%
  lapply(extract2, "gs_name")
goByID <- go %>%
  split(f = .$gs_name) %>%
  lapply(extract2, "gene_id")

For analysis of gene-sets from the GO database, gene-sets were restricted to those with 3 or more steps back to the ontology root terms. A total of 10,975 Ensembl IDs were mapped to pathways from restricted database of 8,825 GO gene sets.

gsSizes <- bind_rows(hm, kg, go) %>% 
  dplyr::select(gs_name, gene_symbol, gene_id) %>% 
  chop(c(gene_symbol, gene_id)) %>%
  mutate(
    gs_size = vapply(gene_symbol, length, integer(1)),
    de_id = lapply(
      X = gene_id, 
      FUN = intersect, 
      y = dplyr::filter(deTable, DE)$gene_id
      ),
    de_size = vapply(de_id, length, integer(1))
  )

Enrichment in the DE Gene Set

The first step of analysis using goseq, regardless of the gene-set, is estimation of the probability weight function (PWF) which quantifies the probability of a gene being considered as DE based on a single covariate. As GC content and length should have been accounted for during conditional-quantile normalisation, these were not required for any bias. However, the gene-level correlations with rRNA contamination were instead used a predictor of bias in selection of a gene as being DE.

rawFqc <- list.files(
  path = here::here("data/0_rawData/FastQC/"),
  pattern = "zip",
  full.names = TRUE
) %>%
  FastqcDataList()
gc <- getModule(rawFqc, "Per_sequence_GC") %>%
  group_by(Filename) %>% 
  mutate(Freq = Count / sum(Count)) %>%
  ungroup()
gcDev <- gc %>%
  left_join(getGC(gcTheoretical, "Drerio", "Trans")) %>%
  mutate(
    sample = str_remove(Filename, "_R[12].fastq.gz"),
    resid = Freq - Drerio
  ) %>% 
  left_join(samples) %>%
  group_by(sample) %>%
  summarise(
    ss = sum(resid^2),
    n = n(),
    sd = sqrt(ss / (n - 1))
  )

riboVec <- structure(gcDev$sd, names = gcDev$sample)
riboCors <- cpm(dgeList, log = TRUE) %>%
  apply(1, function(x){
    cor(x, riboVec[names(x)])
  })

Values were calculated as per the previous steps, using the logCPM values for each gene, with the sample-level standard deviations from the theoretical GC distribution being used as a measure of rRNA contamination. For estimation of the probability weight function, squared correlations were used to place negative and positive correlations on the same scale. This accounts for genes which are both negatively & positively biased by the presence of excessive rRNA. Clearly, the confounding of genotype with rRNA means that some genes driven by the genuine biology may be down-weighted under this approach.

riboPwf <- deTable %>%
  mutate(riboCors = riboCors[gene_id]^2) %>%
  dplyr::select(gene_id, DE, riboCors) %>%
  distinct(gene_id, .keep_all = TRUE) %>% 
  with(
    nullp(
      DEgenes = structure(
        as.integer(DE), names = gene_id
      ), 
      genome = "danRer10", 
      id = "ensGene", 
      bias.data = riboCors,
      plot.fit = FALSE
    )
  )
plotPWF(riboPwf, main = "Bias from rRNA proportions")

Using this approach, it was clear that correlation with rRNA proportions significantly biased the probability of a gene being considered as DE.

Version	Author	Date
468e6e3	Steve Ped	2020-01-28

All gene-sets were then tested using this PWF.

Hallmark Gene Sets

hmRiboGoseq <- goseq(riboPwf, gene2cat = hmByGene) %>%
  as_tibble %>%
  dplyr::filter(numDEInCat > 0) %>%
  mutate( 
    adjP = p.adjust(over_represented_pvalue, method = "bonf"),
    FDR = p.adjust(over_represented_pvalue, method = "fdr")
  ) %>%
  dplyr::select(-contains("under")) %>%
  dplyr::rename(
    gs_name = category,
    PValue = over_represented_pvalue
  )

No gene-sets achieved significance in the DE genes with the lowest FDR being 43%

KEGG Gene Sets

kgRiboGoseq <- goseq(riboPwf, gene2cat = kgByGene) %>%
  as_tibble %>%
  dplyr::filter(numDEInCat > 0) %>%
  mutate( 
    adjP = p.adjust(over_represented_pvalue, method = "bonf"),
    FDR = p.adjust(over_represented_pvalue, method = "fdr")
  ) %>%
  dplyr::select(-contains("under"))  %>%
  dplyr::rename(
    gs_name = category,
    PValue = over_represented_pvalue
  )

kgRiboGoseq %>%
  dplyr::slice(1:5) %>%
  mutate(
    p = formatP(PValue),
    adjP = formatP(adjP),
    FDR = formatP(FDR)
  ) %>%
  dplyr::select(
    `Gene Set` = gs_name,
    DE = numDEInCat,
    `Set Size` = numInCat,
    PValue,
    `p~bonf~` = adjP,
    `p~FDR~` = FDR
  ) %>%
  pander(
    justify = "lrrrrr",
    caption = paste(
      "The", nrow(.), "most highly-ranked KEGG pathways.",
      "Bonferroni-adjusted p-values are the most stringent and give high",
      "confidence when below 0.05."
    )
  )

The 5 most highly-ranked KEGG pathways. Bonferroni-adjusted p-values are the most stringent and give high confidence when below 0.05.
Gene Set	DE	Set Size	PValue	p_bonf	p_FDR
KEGG_RIBOSOME	26	80	1.275e-08	1.44e-06	1.44e-06
KEGG_PRIMARY_IMMUNODEFICIENCY	2	14	0.0127	1.0000	0.7176
KEGG_CYSTEINE_AND_METHIONINE_METABOLISM	4	26	0.03353	1.0000	0.7710
KEGG_ASTHMA	1	3	0.04063	1.0000	0.7710
KEGG_RETINOL_METABOLISM	3	24	0.05372	1.0000	0.7710

Notably, the KEGG gene-set for Ribosomal genes was detected as enriched in the set of DE genes, with no other KEGG gene-sets being considered significant.

GO Gene Sets

goRiboGoseq <- goseq(riboPwf, gene2cat = goByGene) %>%
  as_tibble %>%
  dplyr::filter(numDEInCat > 0) %>%
  mutate( 
    adjP = p.adjust(over_represented_pvalue, method = "bonf"),
    FDR = p.adjust(over_represented_pvalue, method = "fdr")
  ) %>%
  dplyr::select(-contains("under")) %>%
  dplyr::rename(
    gs_name = category,
    PValue = over_represented_pvalue
  )

goRiboGoseq %>%
  dplyr::filter(adjP < 0.05) %>%
  mutate(
    p = formatP(PValue),
    adjP = formatP(adjP),
    FDR = formatP(FDR)
  ) %>%
  dplyr::select(
    `Gene Set` = gs_name,
    DE = numDEInCat,
    `Set Size` = numInCat,
    PValue,
    `p~bonf~` = adjP,
    `p~FDR~` = FDR
  ) %>%
  pander(
    justify = "lrrrrr",
    caption = paste(
      "*The", nrow(.), "most highly-ranked GO terms.",
      "Bonferroni-adjusted p-values are the most stringent and give high",
      "confidence when below 0.05, with all terms here reaching this threshold.",
      "However, most terms once again indicate the presence of rRNA.*"
    )
  )

The 14 most highly-ranked GO terms. Bonferroni-adjusted p-values are the most stringent and give high confidence when below 0.05, with all terms here reaching this threshold. However, most terms once again indicate the presence of rRNA.
Gene Set	DE	Set Size	PValue	p_bonf	p_FDR
GO_COTRANSLATIONAL_PROTEIN_TARGETING_TO_MEMBRANE	28	93	1.418e-09	5.48e-06	5.48e-06
GO_CYTOSOLIC_RIBOSOME	26	96	7.457e-09	2.88e-05	1.44e-05
GO_ESTABLISHMENT_OF_PROTEIN_LOCALIZATION_TO_ENDOPLASMIC_RETICULUM	27	105	2.869e-08	0.0001	3.70e-05
GO_PROTEIN_TARGETING_TO_MEMBRANE	30	170	4.317e-07	0.0017	0.0004
GO_PROTEIN_LOCALIZATION_TO_ENDOPLASMIC_RETICULUM	27	128	6.68e-07	0.0026	0.0005
GO_TRANSLATIONAL_INITIATION	28	176	1.073e-06	0.0041	0.0007
GO_VIRAL_GENE_EXPRESSION	29	172	1.229e-06	0.0048	0.0007
GO_NUCLEAR_TRANSCRIBED_MRNA_CATABOLIC_PROCESS	30	183	1.48e-06	0.0057	0.0007
GO_RIBOSOMAL_SUBUNIT	28	175	3.178e-06	0.0123	0.0014
GO_CYTOSOLIC_LARGE_RIBOSOMAL_SUBUNIT	15	51	3.988e-06	0.0154	0.0015
GO_CYTOSOLIC_PART	28	202	5.721e-06	0.0221	0.0019
GO_SYMPORTER_ACTIVITY	10	92	5.843e-06	0.0226	0.0019
GO_RIBOSOME	29	209	1.072e-05	0.0414	0.0030
GO_ANION_TRANSMEMBRANE_TRANSPORTER_ACTIVITY	17	224	1.098e-05	0.0425	0.0030

Enrichment Testing on the Complete Set of Genes

Hallmark Gene Sets

hmFry <- cpmPostNorm %>%
  fry(
    index = hmByID,
    design = dgeList$design,
    contrast = "mutant",
    sort = "directional"
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()

hmFry %>%
  dplyr::filter(
    PValue < 0.01
  ) %>%
  left_join(gsSizes) %>%
  mutate(gs_name = str_remove(gs_name, "HALLMARK_")) %>%
  dplyr::select(
    `Hallmark Gene Set` = gs_name,
    `Expressed Genes` = NGenes,
    `DE Genes` = de_size,
    Direction,
    PValue, FDR
    ) %>%
  pander(
    justify = "lrrlrr",
    caption = "Enriched Hallmark Gene Sets, using a threshold of p < 0.01, which corresponds to an FDR of 0.02"
  )

Enriched Hallmark Gene Sets, using a threshold of p < 0.01, which corresponds to an FDR of 0.02
Hallmark Gene Set	Expressed Genes	DE Genes	Direction	PValue	FDR
UV_RESPONSE_UP	139	7	Up	2.432e-06	0.0001216
XENOBIOTIC_METABOLISM	144	11	Up	8.71e-05	0.001463
REACTIVE_OXYGEN_SPECIES_PATHWAY	45	5	Up	0.0001125	0.001463
MTORC1_SIGNALING	191	12	Up	0.0001464	0.001463
FATTY_ACID_METABOLISM	134	6	Up	0.000152	0.001463
GLYCOLYSIS	168	7	Up	0.0001756	0.001463
ESTROGEN_RESPONSE_LATE	144	3	Up	0.0003224	0.0019
BILE_ACID_METABOLISM	81	1	Up	0.0004024	0.0019
ADIPOGENESIS	181	13	Up	0.000417	0.0019
COAGULATION	75	3	Up	0.0004179	0.0019
PEROXISOME	93	2	Up	0.0004181	0.0019
PROTEIN_SECRETION	92	4	Up	0.0006471	0.002696
UNFOLDED_PROTEIN_RESPONSE	113	4	Up	0.0008653	0.003093
HYPOXIA	163	8	Up	0.0008661	0.003093
CHOLESTEROL_HOMEOSTASIS	67	2	Up	0.001571	0.005236
MYC_TARGETS_V1	190	15	Up	0.00206	0.006137
OXIDATIVE_PHOSPHORYLATION	201	16	Up	0.002158	0.006137
IL2_STAT5_SIGNALING	139	3	Up	0.002209	0.006137
NOTCH_SIGNALING	28	0	Up	0.00283	0.007449
EPITHELIAL_MESENCHYMAL_TRANSITION	153	7	Up	0.00311	0.007594
TNFA_SIGNALING_VIA_NFKB	145	7	Up	0.003189	0.007594
ANGIOGENESIS	25	1	Up	0.005327	0.01211
ANDROGEN_RESPONSE	90	2	Up	0.006328	0.01376
PI3K_AKT_MTOR_SIGNALING	96	2	Up	0.007753	0.01615

Exploration of significant Hallmark Gene Sets

hmFry %>% 
  left_join(gsSizes) %>%
  dplyr::filter(de_size >= 5, FDR < 0.05) %>%
  dplyr::select(gs_name, de_id) %>%
  unnest(de_id) %>% 
  mutate(
    gs_name = str_remove(gs_name, "HALLMARK_"),
    gs_name = fct_lump(gs_name, n = 14)
    ) %>% 
  split(f = .$gs_name) %>% 
  lapply(magrittr::extract2, "de_id") %>% 
  fromList() %>% 
  upset(
    nsets = length(.), 
    nintersects = 20,
    order.by = "freq", 
    mb.ratio = c(0.6, 0.4),
    sets.x.label = "Number Of DE Genes"
    )

UpSet plot indicating distribution of DE genes within all significant HALLMARK gene sets. Gene sets were restricted to those with an FDR < 0.05 and at least 5 DE genes. The plot is truncated at the right for simplicity. Most gene sets seem relatively independent of each other with regard to DE genes.

Version	Author	Date
a5cdf5d	Steve Ped	2020-03-06

hmHeat <- hmFry %>% 
  left_join(gsSizes) %>%
  dplyr::filter(de_size >= 12, FDR < 0.05) %>%
  dplyr::select(gs_name, de_id) %>%
  unnest(de_id) %>%
  left_join(dgeList$genes, by = c("de_id" = "gene_id")) %>%
  dplyr::select(gs_name, de_id, gene_name) %>%
  mutate(belongs = TRUE) %>%
  pivot_wider(
    id_cols = c(de_id, gene_name),
    names_from = gs_name,
    values_from = belongs,
    values_fill = list(belongs = FALSE)
  ) %>%
  left_join(
    cpmPostNorm %>%
      as.data.frame() %>%
      rownames_to_column("de_id")
  ) 
hmHeat %>%
  dplyr::select(gene_name, starts_with("Ps")) %>%
  as.data.frame() %>%
  column_to_rownames("gene_name") %>%
  as.matrix() %>%
  pheatmap(
    color = viridis_pal(option = "magma")(100),
    legend_breaks = c(seq(-2, 8, by = 2), max(.)),
    legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
    labels_col = hmHeat %>%
      dplyr::select(starts_with("Ps")) %>% 
      colnames() %>% 
      enframe(name = NULL) %>% 
      dplyr::rename(sample = value) %>%
      left_join(dgeList$samples) %>%
      dplyr::select(sample, sampleID) %>%
      with(
        structure(sampleID, names = sample)
      ),
    cutree_rows = 6,
    cutree_cols = 2,
    annotation_names_row = FALSE,
    annotation_row = hmHeat %>%
      dplyr::select(gene_name, starts_with("HALLMARK_")) %>%
      mutate_at(vars(starts_with("HALL")), as.character) %>%
      as.data.frame() %>%
      column_to_rownames("gene_name") %>%
      set_colnames(
        str_remove(colnames(.), "HALLMARK_")
      )
  )

Gene expression patterns for all DE genes in HALLMARK gene sets, containing more than 12 DE genes.

Version	Author	Date
a5cdf5d	Steve Ped	2020-03-06
7fe1505	Steve Ped	2020-03-05
86ffa07	Steve Ped	2020-02-20

KEGG Gene Sets

kgFry <- cpmPostNorm %>%
  fry(
    index = kgByID,
    design = dgeList$design,
    contrast = "mutant",
    sort = "directional"
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()

kgFry %>%
  dplyr::filter(
    PValue < 0.005
  ) %>%
  left_join(gsSizes) %>%
  mutate(gs_name = str_remove(gs_name, "HALLMARK_")) %>%
  dplyr::select(
    `KEGG Gene Set` = gs_name,
    `Expressed Genes` = NGenes,
    `DE Genes` = de_size,
    Direction,
    PValue, FDR
    ) %>%
  pander(
    justify = "lrrlrr",
    caption = "Enriched KEGG Gene Sets, using a threshold p < 0.005, which corresponds to an FDR of 0.02"
  )

Enriched KEGG Gene Sets, using a threshold p < 0.005, which corresponds to an FDR of 0.02
KEGG Gene Set	Expressed Genes	DE Genes	Direction	PValue	FDR
KEGG_GLUTATHIONE_METABOLISM	35	4	Up	1.849e-05	0.003439
KEGG_PRIMARY_IMMUNODEFICIENCY	14	2	Down	9.082e-05	0.004339
KEGG_GLYCINE_SERINE_AND_THREONINE_METABOLISM	24	2	Up	9.583e-05	0.004339
KEGG_ARGININE_AND_PROLINE_METABOLISM	38	2	Up	0.0001024	0.004339
KEGG_LIMONENE_AND_PINENE_DEGRADATION	8	0	Up	0.0001543	0.004339
KEGG_PPAR_SIGNALING_PATHWAY	50	1	Up	0.0001968	0.004339
KEGG_PORPHYRIN_AND_CHLOROPHYLL_METABOLISM	25	1	Up	0.0002012	0.004339
KEGG_RENIN_ANGIOTENSIN_SYSTEM	7	1	Up	0.0002119	0.004339
KEGG_ASCORBATE_AND_ALDARATE_METABOLISM	7	0	Up	0.0002178	0.004339
KEGG_BETA_ALANINE_METABOLISM	18	1	Up	0.0002531	0.004339
KEGG_GLYCEROLIPID_METABOLISM	37	1	Up	0.0002566	0.004339
KEGG_FATTY_ACID_METABOLISM	38	1	Up	0.0003309	0.005033
KEGG_BUTANOATE_METABOLISM	26	0	Up	0.0003885	0.005033
KEGG_AMINOACYL_TRNA_BIOSYNTHESIS	37	0	Up	0.0004055	0.005033
KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY	79	3	Down	0.0004059	0.005033
KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450	24	3	Up	0.0005708	0.006398
KEGG_PYRUVATE_METABOLISM	27	1	Up	0.0006094	0.006398
KEGG_CYSTEINE_AND_METHIONINE_METABOLISM	26	4	Up	0.000627	0.006398
KEGG_COMPLEMENT_AND_COAGULATION_CASCADES	24	1	Up	0.0006632	0.006398
KEGG_EPITHELIAL_CELL_SIGNALING_IN_HELICOBACTER_PYLORI_INFECTION	60	1	Up	0.0007334	0.006398
KEGG_GLYOXYLATE_AND_DICARBOXYLATE_METABOLISM	12	1	Up	0.0007354	0.006398
KEGG_VIBRIO_CHOLERAE_INFECTION	47	2	Up	0.0007567	0.006398
KEGG_SELENOAMINO_ACID_METABOLISM	21	0	Up	0.0008064	0.006521
KEGG_DRUG_METABOLISM_CYTOCHROME_P450	25	3	Up	0.0008583	0.006652
KEGG_LYSINE_DEGRADATION	39	0	Up	0.0009088	0.006761
KEGG_BIOSYNTHESIS_OF_UNSATURATED_FATTY_ACIDS	20	0	Up	0.001016	0.007162
KEGG_NITROGEN_METABOLISM	13	1	Up	0.00104	0.007162
KEGG_B_CELL_RECEPTOR_SIGNALING_PATHWAY	63	3	Down	0.001104	0.007333
KEGG_LEUKOCYTE_TRANSENDOTHELIAL_MIGRATION	85	1	Up	0.001264	0.007686
KEGG_RIBOSOME	80	26	Up	0.001277	0.007686
KEGG_RETINOL_METABOLISM	24	3	Up	0.001281	0.007686
KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_BIOSYNTHESIS	10	0	Up	0.001442	0.008167
KEGG_PURINE_METABOLISM	132	4	Up	0.001465	0.008167
KEGG_PEROXISOME	67	0	Up	0.001493	0.008167
KEGG_PATHOGENIC_ESCHERICHIA_COLI_INFECTION	44	3	Up	0.001753	0.009097
KEGG_PHENYLALANINE_METABOLISM	10	1	Up	0.001762	0.009097
KEGG_VALINE_LEUCINE_AND_ISOLEUCINE_DEGRADATION	40	0	Up	0.001835	0.009097
KEGG_TYROSINE_METABOLISM	27	2	Up	0.001859	0.009097
KEGG_N_GLYCAN_BIOSYNTHESIS	44	1	Up	0.00209	0.009719
KEGG_ALZHEIMERS_DISEASE	140	12	Up	0.002092	0.009719
KEGG_CITRATE_CYCLE_TCA_CYCLE	26	2	Up	0.002142	0.009719
KEGG_OXIDATIVE_PHOSPHORYLATION	119	15	Up	0.002372	0.0104
KEGG_PROPANOATE_METABOLISM	28	1	Up	0.002404	0.0104
KEGG_PROXIMAL_TUBULE_BICARBONATE_RECLAMATION	15	0	Up	0.002582	0.01092
KEGG_PROTEASOME	41	2	Up	0.002706	0.01118
KEGG_GLYCOLYSIS_GLUCONEOGENESIS	39	3	Up	0.00288	0.01164
KEGG_AMINO_SUGAR_AND_NUCLEOTIDE_SUGAR_METABOLISM	36	0	Up	0.003	0.01187
KEGG_STARCH_AND_SUCROSE_METABOLISM	19	1	Up	0.003293	0.01276
KEGG_ABC_TRANSPORTERS	31	0	Up	0.004088	0.01552
KEGG_ARACHIDONIC_ACID_METABOLISM	18	1	Up	0.00421	0.01566
KEGG_GLYCEROPHOSPHOLIPID_METABOLISM	56	1	Up	0.004304	0.0157

Exploration of Significant KEGG Gene Sets

kgFry %>% 
  left_join(gsSizes) %>%
  dplyr::filter(de_size >= 4, FDR < 0.05) %>%
  dplyr::select(gs_name, de_id) %>%
  unnest(de_id) %>% 
  mutate(
    gs_name = str_remove(gs_name, "KEGG_"),
    gs_name = fct_lump(gs_name, n = 14)
    ) %>% 
  split(f = .$gs_name) %>% 
  lapply(magrittr::extract2, "de_id") %>% 
  fromList() %>% 
  upset(
    nsets = length(.), 
    nintersects = 20,
    order.by = "freq", 
    mb.ratio = c(0.6, 0.4),
    sets.x.label = "Number Of DE Genes"
    )

UpSet plot indicating distribution of DE genes within all significant terms from the KEGG gene sets. Gene sets were restricted to those with an FDR < 0.05 and four or more DE genes. There is considerable overlap between DE genes in Oxidative Phosphorylation and both Parkinson’s and Huntington’s Disease, indicating that these terms essentially capture the same expression signal within this dataset.

Version	Author	Date
a5cdf5d	Steve Ped	2020-03-06

riboOx <- kg %>%
  dplyr::filter(
    grepl("OXIDATIVE_PHOS", gs_name) | grepl("RIBOSOME", gs_name),
    gene_id %in% dplyr::filter(deTable, DE)$gene_id
    ) %>%
  mutate(gs_name = str_remove(gs_name, "(KEGG|HALLMARK)_")) %>%
  distinct(gs_name, gene_id) %>%
  left_join(
    cpmPostNorm %>% 
      as.data.frame() %>% 
      rownames_to_column("gene_id") %>%
      as_tibble()
    ) %>%
  pivot_longer(cols = starts_with("Ps"), names_to = "sample", values_to = "CPM") %>%
  left_join(dgeList$samples) %>% 
  left_join(dgeList$genes) %>%
  dplyr::select(gs_name, gene_name, sampleID, genotype, CPM) %>%
  pivot_wider(
    id_cols = c(gs_name, gene_name),
    values_from = CPM,
    names_from = sampleID
  ) 
riboOx %>% 
  dplyr::select(-gs_name) %>%
  as.data.frame() %>%
  column_to_rownames("gene_name") %>%
  as.matrix() %>%
  pheatmap(
    annotation_row = data.frame(
      GeneSet = riboOx$gs_name %>% str_replace("OXI.+", "OXPHOS"), 
      row.names = riboOx$gene_name
      ),
    color = viridis_pal(option = "magma")(100),
    legend_breaks = c(seq(-2, 8, by = 2), max(.)),
    legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
    annotation_names_row = FALSE,
    cutree_rows = 3,
    cutree_cols = 2
  )

Given that the largest gene sets within the KEGG results were Oxidative Phosphorylation and Ribosomal gene sets, all DE genes associated with these KEGG terms are displayed. All genes appear to show increased expression in mutant samples.

Version	Author	Date
7fe1505	Steve Ped	2020-03-05
86ffa07	Steve Ped	2020-02-20

GO Gene Sets

goFry <- cpmPostNorm %>%
  fry(
    index = goByID,
    design = dgeList$design,
    contrast = "mutant",
    sort = "directional"
    ) %>%
  rownames_to_column("gs_name") %>%
  as_tibble()

goFry %>%
  left_join(gsSizes) %>%
  left_join(distinct(go, gs_name, shortest_path)) %>%
  dplyr::filter(
    FDR < 0.01, 
    shortest_path > 3, 
    NGenes < 1000,
    de_size >= 5
  ) %>%
  dplyr::select(
    `GO Gene Set` = gs_name,
    `Expressed Genes` = NGenes,
    `DE Genes` = de_size,
    Direction,
    PValue, FDR
    ) %>%
  pander(
    justify = "lrrlrr",
    caption = "Enriched GO Gene Sets, using the thresholds FDR < 0.01, at least 5 DE genes, a set size < 1000 and more than 3 steps back to the ontology root."
  )

Enriched GO Gene Sets, using the thresholds FDR < 0.01, at least 5 DE genes, a set size < 1000 and more than 3 steps back to the ontology root.
GO Gene Set	Expressed Genes	DE Genes	Direction	PValue	FDR
GO_PEPTIDYL_LYSINE_TRIMETHYLATION	42	5	Down	6.726e-06	0.001799
GO_FATTY_ACID_BIOSYNTHETIC_PROCESS	94	6	Up	7.648e-06	0.001875
GO_ORGANIC_ACID_TRANSPORT	235	14	Up	2.818e-05	0.00307
GO_RESPONSE_TO_CORTICOSTEROID	119	5	Up	4.281e-05	0.003698
GO_REGULATION_OF_LIPID_BIOSYNTHETIC_PROCESS	135	6	Up	4.378e-05	0.003698
GO_PEPTIDYL_LYSINE_METHYLATION	108	5	Down	4.386e-05	0.003698
GO_PEPTIDYL_LYSINE_ACETYLATION	142	5	Down	5.004e-05	0.003698
GO_ORGANIC_ACID_BIOSYNTHETIC_PROCESS	193	12	Up	5.119e-05	0.003698
GO_HISTONE_METHYLATION	116	5	Down	6.243e-05	0.00389
GO_RESPONSE_TO_ALCOHOL	177	7	Up	7.22e-05	0.004142
GO_MONOCARBOXYLIC_ACID_BIOSYNTHETIC_PROCESS	123	7	Up	7.229e-05	0.004142
GO_NUCLEAR_BODY	639	21	Down	7.46e-05	0.004169
GO_CARBOHYDRATE_BIOSYNTHETIC_PROCESS	144	5	Up	9.112e-05	0.004356
GO_MONOCARBOXYLIC_ACID_TRANSPORT	108	6	Up	9.591e-05	0.004356
GO_MONOCARBOXYLIC_ACID_METABOLIC_PROCESS	309	10	Up	0.0001007	0.004423
GO_SYMPORTER_ACTIVITY	92	10	Up	0.0001016	0.004437
GO_DRUG_TRANSPORT	149	8	Up	0.0001207	0.004725
GO_LIPID_BIOSYNTHETIC_PROCESS	497	13	Up	0.0001293	0.004725
GO_PROTEIN_METHYLATION	148	8	Down	0.0001373	0.004725
GO_ORGANIC_ACID_TRANSMEMBRANE_TRANSPORT	104	9	Up	0.0001446	0.004798
GO_AMINO_ACID_TRANSMEMBRANE_TRANSPORT	68	5	Up	0.0001456	0.004811
GO_SECONDARY_ACTIVE_TRANSMEMBRANE_TRANSPORTER_ACTIVITY	151	13	Up	0.0001502	0.00482
GO_ANION_TRANSMEMBRANE_TRANSPORT	200	15	Up	0.0001524	0.004854
GO_ORGANIC_ANION_TRANSPORT	336	15	Up	0.0001569	0.004863
GO_ISOPRENOID_METABOLIC_PROCESS	62	5	Up	0.0001576	0.004863
GO_ANION_TRANSPORT	424	21	Up	0.0001619	0.004942
GO_POSITIVE_REGULATION_OF_TRANSCRIPTION_BY_RNA_POLYMERASE_II	880	31	Down	0.0001659	0.004972
GO_SEQUENCE_SPECIFIC_DNA_BINDING	802	29	Down	0.0001694	0.005028
GO_PROTEIN_LOCALIZATION_TO_MEMBRANE	531	36	Up	0.0001734	0.005068
GO_PROTEIN_MATURATION	203	8	Up	0.0001759	0.005123
GO_COENZYME_METABOLIC_PROCESS	191	5	Up	0.0001867	0.005298
GO_PROTEIN_ACETYLATION	166	5	Down	0.0002054	0.005489
GO_REGULATION_OF_PHOSPHATASE_ACTIVITY	145	7	Down	0.0002055	0.005489
GO_REGULATORY_REGION_NUCLEIC_ACID_BINDING	700	24	Down	0.0002062	0.005489
GO_ORGANIC_ANION_TRANSMEMBRANE_TRANSPORTER_ACTIVITY	141	11	Up	0.0002077	0.005489
GO_DRUG_TRANSMEMBRANE_TRANSPORT	64	6	Up	0.0002113	0.005518
GO_ESTABLISHMENT_OF_PROTEIN_LOCALIZATION_TO_MEMBRANE	285	30	Up	0.0002607	0.006007
GO_ANION_TRANSMEMBRANE_TRANSPORTER_ACTIVITY	224	17	Up	0.0002793	0.006178
GO_REGULATION_OF_DEPHOSPHORYLATION	175	7	Down	0.000292	0.006316
GO_ACTIVATING_TRANSCRIPTION_FACTOR_BINDING	67	5	Down	0.0003098	0.006556
GO_AMINO_ACID_TRANSPORT	117	6	Up	0.0003297	0.006704
GO_NUCLEAR_SPECK	342	11	Down	0.0003474	0.00677
GO_COVALENT_CHROMATIN_MODIFICATION	380	13	Down	0.0003554	0.006861
GO_PROXIMAL_PROMOTER_SEQUENCE_SPECIFIC_DNA_BINDING	402	15	Down	0.0003592	0.006861
GO_SECRETORY_VESICLE	693	25	Up	0.0003643	0.0069
GO_REGULATION_OF_CHROMATIN_ORGANIZATION	157	7	Down	0.0003704	0.006985
GO_NUCLEOPLASM_PART	931	31	Down	0.000377	0.006992
GO_SEQUENCE_SPECIFIC_DOUBLE_STRANDED_DNA_BINDING	648	23	Down	0.0003923	0.00708
GO_PURINE_CONTAINING_COMPOUND_BIOSYNTHETIC_PROCESS	167	6	Up	0.0003928	0.00708
GO_DOUBLE_STRANDED_DNA_BINDING	720	30	Down	0.0004054	0.007185
GO_REGULATION_OF_PEPTIDE_HORMONE_SECRETION	154	9	Up	0.0004067	0.007185
GO_NUCLEOSIDE_PHOSPHATE_BIOSYNTHETIC_PROCESS	220	10	Up	0.0004112	0.007223
GO_AMINO_ACID_TRANSMEMBRANE_TRANSPORTER_ACTIVITY	59	5	Up	0.0004172	0.007248
GO_ROUGH_ENDOPLASMIC_RETICULUM	65	6	Up	0.0004244	0.007297
GO_TUBULIN_BINDING	254	6	Down	0.000427	0.007303
GO_IMPORT_INTO_CELL	522	8	Up	0.0004424	0.007381
GO_REGULATION_OF_MRNA_PROCESSING	117	5	Down	0.0004506	0.007447
GO_MONOVALENT_INORGANIC_CATION_TRANSMEMBRANE_TRANSPORTER_ACTIVITY	263	14	Up	0.000468	0.00758
GO_GUANYL_NUCLEOTIDE_BINDING	312	7	Up	0.0004814	0.007715
GO_INORGANIC_ANION_TRANSPORT	116	7	Up	0.0005049	0.007892
GO_RIBOSE_PHOSPHATE_BIOSYNTHETIC_PROCESS	163	7	Up	0.0005158	0.007892
GO_CATION_TRANSMEMBRANE_TRANSPORTER_ACTIVITY	450	19	Up	0.0005415	0.008155
GO_LIPID_CATABOLIC_PROCESS	212	7	Up	0.000557	0.008172
GO_MONOVALENT_INORGANIC_CATION_HOMEOSTASIS	115	5	Up	0.0005709	0.008265
GO_REGULATION_OF_INSULIN_SECRETION	137	9	Up	0.0005907	0.008404
GO_MONOVALENT_INORGANIC_CATION_TRANSPORT	362	18	Up	0.0006178	0.008534
GO_ORGANIC_HYDROXY_COMPOUND_TRANSPORT	160	6	Up	0.0006252	0.008538
GO_MRNA_METABOLIC_PROCESS	449	26	Down	0.0006296	0.008572
GO_ION_TRANSMEMBRANE_TRANSPORT	806	30	Up	0.0006426	0.008593
GO_CARBOHYDRATE_DERIVATIVE_BIOSYNTHETIC_PROCESS	526	17	Up	0.000706	0.009
GO_GTPASE_ACTIVITY	250	5	Up	0.0007373	0.009225
GO_PROTON_TRANSMEMBRANE_TRANSPORT	118	10	Up	0.0007423	0.009265
GO_REGULATION_OF_CELLULAR_KETONE_METABOLIC_PROCESS	105	6	Up	0.000745	0.009273
GO_PROTEIN_PROCESSING	138	6	Up	0.0007627	0.009316
GO_PROTEIN_HOMOOLIGOMERIZATION	240	10	Up	0.0007654	0.009316
GO_ORGANIC_ACID_CATABOLIC_PROCESS	180	5	Up	0.0007686	0.009343
GO_ORGANOPHOSPHATE_BIOSYNTHETIC_PROCESS	448	14	Up	0.0007831	0.009367
GO_HORMONE_TRANSPORT	234	12	Up	0.0007869	0.009394
GO_PEPTIDE_HORMONE_SECRETION	191	9	Up	0.0007989	0.009489
GO_TRANSMEMBRANE_RECEPTOR_PROTEIN_TYROSINE_KINASE_SIGNALING_PATHWAY	557	13	Up	0.0008038	0.009496
GO_ALPHA_AMINO_ACID_METABOLIC_PROCESS	123	6	Up	0.0008582	0.00981
GO_CATION_TRANSPORT	817	29	Up	0.0008836	0.009928
GO_PROTON_TRANSMEMBRANE_TRANSPORTER_ACTIVITY	97	6	Up	0.0008878	0.009928
GO_REGULATION_OF_DNA_REPLICATION	78	6	Down	0.0008891	0.009928

Exploration of Significant GO Gene Sets

goFry %>% 
  left_join(gsSizes) %>%
  left_join(distinct(go, gs_name, shortest_path)) %>%
  mutate(
    propDE = de_size / gs_size
  ) %>%
  dplyr::filter(
    de_size >= 15, 
    propDE > 0.05,
    FDR < 0.02,
    shortest_path > 3
  ) %>%
  dplyr::select(gs_name, de_id) %>%
  unnest(de_id) %>% 
  mutate(
    gs_name = str_remove(gs_name, "GO_"),
    gs_name = fct_lump(gs_name, n = 20)
    ) %>% 
  split(f = .$gs_name) %>% 
  lapply(magrittr::extract2, "de_id") %>% 
  fromList() %>% 
  upset(
    nsets = length(.), 
    nintersects = 20,
    order.by = "freq", 
    mb.ratio = c(0.6, 0.4),
    sets.x.label = "Number Of DE Genes"
    )

UpSet plot indicating distribution of DE genes within significantly enriched terms from the GO gene sets. For this visualisation, GO terms were restricted to those with 15 or more DE genes, where this represented more than 5% of a gene set as being DE, along with an FDR < 0.02 and more than 3 steps back to the ontology root. The 20 largest GO terms satisfying these criteria are shown. The plot is truncated at the right hand side for simplicity. A group of 28 genes is uniquely attributed to the Mitochondrial Envelope, with a further 18 being relatively unique to mRNA Metabolic Process. The next grouping of 15 genes are unique to Regulation Of Nucleobase-Containing Compound Metabolic Process followed by 25 genes, spread across two clusters of terms which largely represent Ribosomal activity. In between these are 13 genes uniquely associated with Anion Transport.

Version	Author	Date
a5cdf5d	Steve Ped	2020-03-06

Genes Associated with the Mitochondrial Envelope

go %>%
  dplyr::filter(
    gs_name == "GO_MITOCHONDRIAL_ENVELOPE",
    gene_id %in% dplyr::filter(deTable, DE)$gene_id
  ) %>%
  dplyr::select(
    gene_id, gene_name
  ) %>%
  left_join(
    cpmPostNorm %>%
      as.data.frame() %>%
      rownames_to_column("gene_id")
  ) %>%
  dplyr::select(gene_name, starts_with("Ps")) %>%
  as.data.frame() %>%
  column_to_rownames("gene_name") %>%
  as.matrix() %>%
  pheatmap(
    color = viridis_pal(option = "magma")(100),
    legend_breaks = c(seq(-2, 8, by = 2), max(.)),
    legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
    labels_col = hmHeat %>%
      dplyr::select(starts_with("Ps")) %>% 
      colnames() %>% 
      enframe(name = NULL) %>% 
      dplyr::rename(sample = value) %>%
      left_join(dgeList$samples) %>%
      dplyr::select(sample, sampleID) %>%
      with(
        structure(sampleID, names = sample)
      ),
    annotation_col = dgeList$samples %>%
      dplyr::select(Genotype = genotype),
    cutree_rows = 4,
    cutree_cols = 2
  )

All DE genes associated with the GO term ‘Mitochondrial Envelope’.

Version	Author	Date
a5cdf5d	Steve Ped	2020-03-06

Genes Associated with RNA Metabolic Process

go %>%
  dplyr::filter(
    gs_name == "GO_MRNA_METABOLIC_PROCESS",
    gene_id %in% dplyr::filter(deTable, DE)$gene_id
  ) %>%
  dplyr::select(
    gene_id, gene_name
  ) %>%
  left_join(
    cpmPostNorm %>%
      as.data.frame() %>%
      rownames_to_column("gene_id")
  ) %>%
  dplyr::select(gene_name, starts_with("Ps")) %>%
  as.data.frame() %>%
  column_to_rownames("gene_name") %>%
  as.matrix() %>%
  pheatmap(
    color = viridis_pal(option = "magma")(100),
    legend_breaks = c(seq(-2, 8, by = 2), max(.)),
    legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
    labels_col = hmHeat %>%
      dplyr::select(starts_with("Ps")) %>% 
      colnames() %>% 
      enframe(name = NULL) %>% 
      dplyr::rename(sample = value) %>%
      left_join(dgeList$samples) %>%
      dplyr::select(sample, sampleID) %>%
      with(
        structure(sampleID, names = sample)
      ),
    annotation_col = dgeList$samples %>%
      dplyr::select(Genotype = genotype),
    cutree_rows = 4,
    cutree_cols = 2
  )

All DE genes associated with the GO term ‘mRNA Metabolic Process’.

Version	Author	Date
a5cdf5d	Steve Ped	2020-03-06

Genes Associated with Regulation Of Nucleobase-Containing Compound Metabolic Process

go %>%
  dplyr::filter(
    gs_name == "GO_REGULATION_OF_NUCLEOBASE_CONTAINING_COMPOUND_METABOLIC_PROCESS",
    gene_id %in% dplyr::filter(deTable, DE)$gene_id
  ) %>%
  dplyr::select(
    gene_id, gene_name
  ) %>%
  left_join(
    cpmPostNorm %>%
      as.data.frame() %>%
      rownames_to_column("gene_id")
  ) %>%
  dplyr::select(gene_name, starts_with("Ps")) %>%
  as.data.frame() %>%
  column_to_rownames("gene_name") %>%
  as.matrix() %>%
  pheatmap(
    color = viridis_pal(option = "magma")(100),
    legend_breaks = c(seq(-2, 8, by = 2), max(.)),
    legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
    labels_col = hmHeat %>%
      dplyr::select(starts_with("Ps")) %>% 
      colnames() %>% 
      enframe(name = NULL) %>% 
      dplyr::rename(sample = value) %>%
      left_join(dgeList$samples) %>%
      dplyr::select(sample, sampleID) %>%
      with(
        structure(sampleID, names = sample)
      ),
    annotation_col = dgeList$samples %>%
      dplyr::select(Genotype = genotype),
    cutree_rows = 4,
    cutree_cols = 2
  )

All DE genes associated with the GO term ‘mRNA Metabolic Process’.

Version	Author	Date
a5cdf5d	Steve Ped	2020-03-06

Genes Associated with Anion Transport

go %>%
  dplyr::filter(
    gs_name == "GO_ANION_TRANSPORT",
    gene_id %in% dplyr::filter(deTable, DE)$gene_id
  ) %>%
  dplyr::select(
    gene_id, gene_name
  ) %>%
  left_join(
    cpmPostNorm %>%
      as.data.frame() %>%
      rownames_to_column("gene_id")
  ) %>%
  dplyr::select(gene_name, starts_with("Ps")) %>%
  as.data.frame() %>%
  column_to_rownames("gene_name") %>%
  as.matrix() %>%
  pheatmap(
    color = viridis_pal(option = "magma")(100),
    legend_breaks = c(seq(-2, 8, by = 2), max(.)),
    legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
    labels_col = hmHeat %>%
      dplyr::select(starts_with("Ps")) %>% 
      colnames() %>% 
      enframe(name = NULL) %>% 
      dplyr::rename(sample = value) %>%
      left_join(dgeList$samples) %>%
      dplyr::select(sample, sampleID) %>%
      with(
        structure(sampleID, names = sample)
      ),
    annotation_col = dgeList$samples %>%
      dplyr::select(Genotype = genotype),
    cutree_rows = 3,
    cutree_cols = 2
  )

All DE genes associated with the GO term ‘Anion Transport’.

Version	Author	Date
a5cdf5d	Steve Ped	2020-03-06

Genes Associated with Ribosomal Activity

go %>%
  dplyr::filter(
    gs_name == "GO_CYTOSOLIC_RIBOSOME",
    gene_id %in% dplyr::filter(deTable, DE)$gene_id
  ) %>%
  dplyr::select(
    gene_id, gene_name
  ) %>%
  left_join(
    cpmPostNorm %>%
      as.data.frame() %>%
      rownames_to_column("gene_id")
  ) %>%
  dplyr::select(gene_name, starts_with("Ps")) %>%
  as.data.frame() %>%
  column_to_rownames("gene_name") %>%
  as.matrix() %>%
  pheatmap(
    color = viridis_pal(option = "magma")(100),
    legend_breaks = c(seq(-2, 8, by = 2), max(.)),
    legend_labels = c(seq(-2, 8, by = 2), "logCPM\n"),
    labels_col = hmHeat %>%
      dplyr::select(starts_with("Ps")) %>% 
      colnames() %>% 
      enframe(name = NULL) %>% 
      dplyr::rename(sample = value) %>%
      left_join(dgeList$samples) %>%
      dplyr::select(sample, sampleID) %>%
      with(
        structure(sampleID, names = sample)
      ),
    annotation_col = dgeList$samples %>%
      dplyr::select(Genotype = genotype),
    cutree_rows = 3,
    cutree_cols = 2
  )

All DE genes associated with the GO term ‘Cytosolic Ribosome’.

Version	Author	Date
a5cdf5d	Steve Ped	2020-03-06

Data Export

All enriched gene sets with an FDR adjusted p-value < 0.05 were exported as a single csv file.

bind_rows(
  hmFry,
  kgFry,
  goFry
) %>%
  dplyr::filter(FDR < 0.05) %>%
  left_join(gsSizes) %>%
  dplyr::select(
    gs_name, NGenes, Direction, PValue, FDR,
    gene_symbol, de_size
  ) %>%
  mutate(
    DE = lapply(
      X = gene_symbol,
      FUN =  intersect, 
      y = dplyr::filter(deTable, DE)$gene_name
    ),
    DE = vapply(DE, paste, character(1), collapse = ";")
  ) %>%
  dplyr::select(
    `Gene Set` = gs_name, 
    `Nbr Detected Genes` = NGenes, 
    `Nbr DE Genes` = de_size, 
    Direction, PValue, FDR, 
    `DE Genes` = DE
  ) %>%
  write_csv(
    here::here("output", "Enrichment_Mutant_V_WT.csv")
  )

devtools::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.3 (2020-02-29)
 os       Ubuntu 18.04.4 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language en_AU:en                    
 collate  en_AU.UTF-8                 
 ctype    en_AU.UTF-8                 
 tz       Australia/Adelaide          
 date     2020-04-02                  

─ Packages ───────────────────────────────────────────────────────────────────
 package              * version  date       lib source        
 AnnotationDbi        * 1.48.0   2019-10-29 [2] Bioconductor  
 askpass                1.1      2019-01-13 [2] CRAN (R 3.6.0)
 assertthat             0.2.1    2019-03-21 [2] CRAN (R 3.6.0)
 backports              1.1.5    2019-10-02 [2] CRAN (R 3.6.1)
 BiasedUrn            * 1.07     2015-12-28 [2] CRAN (R 3.6.1)
 Biobase              * 2.46.0   2019-10-29 [2] Bioconductor  
 BiocFileCache          1.10.2   2019-11-08 [2] Bioconductor  
 BiocGenerics         * 0.32.0   2019-10-29 [2] Bioconductor  
 BiocParallel           1.20.1   2019-12-21 [2] Bioconductor  
 biomaRt                2.42.0   2019-10-29 [2] Bioconductor  
 Biostrings             2.54.0   2019-10-29 [2] Bioconductor  
 bit                    1.1-15.2 2020-02-10 [2] CRAN (R 3.6.2)
 bit64                  0.9-7    2017-05-08 [2] CRAN (R 3.6.0)
 bitops                 1.0-6    2013-08-17 [2] CRAN (R 3.6.0)
 blob                   1.2.1    2020-01-20 [2] CRAN (R 3.6.2)
 broom                  0.5.4    2020-01-27 [2] CRAN (R 3.6.2)
 callr                  3.4.2    2020-02-12 [2] CRAN (R 3.6.2)
 cellranger             1.1.0    2016-07-27 [2] CRAN (R 3.6.0)
 cli                    2.0.1    2020-01-08 [2] CRAN (R 3.6.2)
 cluster                2.1.0    2019-06-19 [2] CRAN (R 3.6.1)
 colorspace             1.4-1    2019-03-18 [2] CRAN (R 3.6.0)
 crayon                 1.3.4    2017-09-16 [2] CRAN (R 3.6.0)
 curl                   4.3      2019-12-02 [2] CRAN (R 3.6.2)
 data.table             1.12.8   2019-12-09 [2] CRAN (R 3.6.2)
 DBI                    1.1.0    2019-12-15 [2] CRAN (R 3.6.2)
 dbplyr                 1.4.2    2019-06-17 [2] CRAN (R 3.6.0)
 DelayedArray           0.12.2   2020-01-06 [2] Bioconductor  
 desc                   1.2.0    2018-05-01 [2] CRAN (R 3.6.0)
 devtools               2.2.2    2020-02-17 [2] CRAN (R 3.6.2)
 digest                 0.6.25   2020-02-23 [2] CRAN (R 3.6.2)
 dplyr                * 0.8.4    2020-01-31 [2] CRAN (R 3.6.2)
 edgeR                * 3.28.1   2020-02-26 [2] Bioconductor  
 ellipsis               0.3.0    2019-09-20 [2] CRAN (R 3.6.1)
 evaluate               0.14     2019-05-28 [2] CRAN (R 3.6.0)
 FactoMineR             2.2      2020-02-05 [2] CRAN (R 3.6.2)
 fansi                  0.4.1    2020-01-08 [2] CRAN (R 3.6.2)
 farver                 2.0.3    2020-01-16 [2] CRAN (R 3.6.2)
 flashClust             1.01-2   2012-08-21 [2] CRAN (R 3.6.1)
 forcats              * 0.4.0    2019-02-17 [2] CRAN (R 3.6.0)
 fs                     1.3.1    2019-05-06 [2] CRAN (R 3.6.0)
 geneLenDataBase      * 1.22.0   2019-11-05 [2] Bioconductor  
 generics               0.0.2    2018-11-29 [2] CRAN (R 3.6.0)
 GenomeInfoDb           1.22.0   2019-10-29 [2] Bioconductor  
 GenomeInfoDbData       1.2.2    2019-11-21 [2] Bioconductor  
 GenomicAlignments      1.22.1   2019-11-12 [2] Bioconductor  
 GenomicFeatures        1.38.2   2020-02-15 [2] Bioconductor  
 GenomicRanges          1.38.0   2019-10-29 [2] Bioconductor  
 ggdendro               0.1-20   2016-04-27 [2] CRAN (R 3.6.0)
 ggplot2              * 3.2.1    2019-08-10 [2] CRAN (R 3.6.1)
 ggrepel                0.8.1    2019-05-07 [2] CRAN (R 3.6.0)
 git2r                  0.26.1   2019-06-29 [2] CRAN (R 3.6.1)
 glue                   1.3.1    2019-03-12 [2] CRAN (R 3.6.0)
 GO.db                  3.10.0   2019-11-21 [2] Bioconductor  
 goseq                * 1.38.0   2019-10-29 [2] Bioconductor  
 gridExtra              2.3      2017-09-09 [2] CRAN (R 3.6.0)
 gtable                 0.3.0    2019-03-25 [2] CRAN (R 3.6.0)
 haven                  2.2.0    2019-11-08 [2] CRAN (R 3.6.1)
 here                   0.1      2017-05-28 [2] CRAN (R 3.6.0)
 highr                  0.8      2019-03-20 [2] CRAN (R 3.6.0)
 hms                    0.5.3    2020-01-08 [2] CRAN (R 3.6.2)
 htmltools              0.4.0    2019-10-04 [2] CRAN (R 3.6.1)
 htmlwidgets            1.5.1    2019-10-08 [2] CRAN (R 3.6.1)
 httpuv                 1.5.2    2019-09-11 [2] CRAN (R 3.6.1)
 httr                   1.4.1    2019-08-05 [2] CRAN (R 3.6.1)
 hwriter                1.3.2    2014-09-10 [2] CRAN (R 3.6.0)
 IRanges              * 2.20.2   2020-01-13 [2] Bioconductor  
 jpeg                   0.1-8.1  2019-10-24 [2] CRAN (R 3.6.1)
 jsonlite               1.6.1    2020-02-02 [2] CRAN (R 3.6.2)
 kableExtra             1.1.0    2019-03-16 [2] CRAN (R 3.6.1)
 knitr                  1.28     2020-02-06 [2] CRAN (R 3.6.2)
 labeling               0.3      2014-08-23 [2] CRAN (R 3.6.0)
 later                  1.0.0    2019-10-04 [2] CRAN (R 3.6.1)
 lattice                0.20-40  2020-02-19 [4] CRAN (R 3.6.2)
 latticeExtra           0.6-29   2019-12-19 [2] CRAN (R 3.6.2)
 lazyeval               0.2.2    2019-03-15 [2] CRAN (R 3.6.0)
 leaps                  3.1      2020-01-16 [2] CRAN (R 3.6.2)
 lifecycle              0.1.0    2019-08-01 [2] CRAN (R 3.6.1)
 limma                * 3.42.2   2020-02-03 [2] Bioconductor  
 locfit                 1.5-9.1  2013-04-20 [2] CRAN (R 3.6.0)
 lubridate              1.7.4    2018-04-11 [2] CRAN (R 3.6.0)
 magrittr             * 1.5      2014-11-22 [2] CRAN (R 3.6.0)
 MASS                   7.3-51.5 2019-12-20 [4] CRAN (R 3.6.2)
 Matrix                 1.2-18   2019-11-27 [2] CRAN (R 3.6.1)
 matrixStats            0.55.0   2019-09-07 [2] CRAN (R 3.6.1)
 memoise                1.1.0    2017-04-21 [2] CRAN (R 3.6.0)
 mgcv                   1.8-31   2019-11-09 [4] CRAN (R 3.6.1)
 modelr                 0.1.6    2020-02-22 [2] CRAN (R 3.6.2)
 msigdbr              * 7.0.1    2019-09-04 [2] CRAN (R 3.6.2)
 munsell                0.5.0    2018-06-12 [2] CRAN (R 3.6.0)
 ngsReports           * 1.1.2    2019-10-16 [1] Bioconductor  
 nlme                   3.1-144  2020-02-06 [4] CRAN (R 3.6.2)
 openssl                1.4.1    2019-07-18 [2] CRAN (R 3.6.1)
 pander               * 0.6.3    2018-11-06 [2] CRAN (R 3.6.0)
 pheatmap             * 1.0.12   2019-01-04 [2] CRAN (R 3.6.0)
 pillar                 1.4.3    2019-12-20 [2] CRAN (R 3.6.2)
 pkgbuild               1.0.6    2019-10-09 [2] CRAN (R 3.6.1)
 pkgconfig              2.0.3    2019-09-22 [2] CRAN (R 3.6.1)
 pkgload                1.0.2    2018-10-29 [2] CRAN (R 3.6.0)
 plotly                 4.9.2    2020-02-12 [2] CRAN (R 3.6.2)
 plyr                   1.8.5    2019-12-10 [2] CRAN (R 3.6.2)
 png                    0.1-7    2013-12-03 [2] CRAN (R 3.6.0)
 prettyunits            1.1.1    2020-01-24 [2] CRAN (R 3.6.2)
 processx               3.4.2    2020-02-09 [2] CRAN (R 3.6.2)
 progress               1.2.2    2019-05-16 [2] CRAN (R 3.6.0)
 promises               1.1.0    2019-10-04 [2] CRAN (R 3.6.1)
 ps                     1.3.2    2020-02-13 [2] CRAN (R 3.6.2)
 purrr                * 0.3.3    2019-10-18 [2] CRAN (R 3.6.1)
 R6                     2.4.1    2019-11-12 [2] CRAN (R 3.6.1)
 rappdirs               0.3.1    2016-03-28 [2] CRAN (R 3.6.0)
 RColorBrewer         * 1.1-2    2014-12-07 [2] CRAN (R 3.6.0)
 Rcpp                   1.0.3    2019-11-08 [2] CRAN (R 3.6.1)
 RCurl                  1.98-1.1 2020-01-19 [2] CRAN (R 3.6.2)
 readr                * 1.3.1    2018-12-21 [2] CRAN (R 3.6.0)
 readxl                 1.3.1    2019-03-13 [2] CRAN (R 3.6.0)
 remotes                2.1.1    2020-02-15 [2] CRAN (R 3.6.2)
 reprex                 0.3.0    2019-05-16 [2] CRAN (R 3.6.0)
 reshape2               1.4.3    2017-12-11 [2] CRAN (R 3.6.0)
 rlang                  0.4.4    2020-01-28 [2] CRAN (R 3.6.2)
 rmarkdown              2.1      2020-01-20 [2] CRAN (R 3.6.2)
 rprojroot              1.3-2    2018-01-03 [2] CRAN (R 3.6.0)
 Rsamtools              2.2.3    2020-02-23 [2] Bioconductor  
 RSQLite                2.2.0    2020-01-07 [2] CRAN (R 3.6.2)
 rstudioapi             0.11     2020-02-07 [2] CRAN (R 3.6.2)
 rtracklayer            1.46.0   2019-10-29 [2] Bioconductor  
 rvest                  0.3.5    2019-11-08 [2] CRAN (R 3.6.1)
 S4Vectors            * 0.24.3   2020-01-18 [2] Bioconductor  
 scales               * 1.1.0    2019-11-18 [2] CRAN (R 3.6.1)
 scatterplot3d          0.3-41   2018-03-14 [2] CRAN (R 3.6.1)
 sessioninfo            1.1.1    2018-11-05 [2] CRAN (R 3.6.0)
 ShortRead              1.44.3   2020-02-03 [2] Bioconductor  
 statmod                1.4.34   2020-02-17 [2] CRAN (R 3.6.2)
 stringi                1.4.6    2020-02-17 [2] CRAN (R 3.6.2)
 stringr              * 1.4.0    2019-02-10 [2] CRAN (R 3.6.0)
 SummarizedExperiment   1.16.1   2019-12-19 [2] Bioconductor  
 testthat               2.3.1    2019-12-01 [2] CRAN (R 3.6.1)
 tibble               * 2.1.3    2019-06-06 [2] CRAN (R 3.6.0)
 tidyr                * 1.0.2    2020-01-24 [2] CRAN (R 3.6.2)
 tidyselect             1.0.0    2020-01-27 [2] CRAN (R 3.6.2)
 tidyverse            * 1.3.0    2019-11-21 [2] CRAN (R 3.6.1)
 truncnorm              1.0-8    2018-02-27 [2] CRAN (R 3.6.0)
 UpSetR               * 1.4.0    2019-05-22 [2] CRAN (R 3.6.0)
 usethis                1.5.1    2019-07-04 [2] CRAN (R 3.6.1)
 vctrs                  0.2.3    2020-02-20 [2] CRAN (R 3.6.2)
 viridisLite            0.3.0    2018-02-01 [2] CRAN (R 3.6.0)
 webshot                0.5.2    2019-11-22 [2] CRAN (R 3.6.1)
 whisker                0.4      2019-08-28 [2] CRAN (R 3.6.1)
 withr                  2.1.2    2018-03-15 [2] CRAN (R 3.6.0)
 workflowr            * 1.6.0    2019-12-19 [2] CRAN (R 3.6.2)
 xfun                   0.12     2020-01-13 [2] CRAN (R 3.6.2)
 XML                    3.99-0.3 2020-01-20 [2] CRAN (R 3.6.2)
 xml2                   1.2.2    2019-08-09 [2] CRAN (R 3.6.1)
 XVector                0.26.0   2019-10-29 [2] Bioconductor  
 yaml                   2.2.1    2020-02-01 [2] CRAN (R 3.6.2)
 zlibbioc               1.32.0   2019-10-29 [2] Bioconductor  
 zoo                    1.8-7    2020-01-10 [2] CRAN (R 3.6.2)

[1] /home/steveped/R/x86_64-pc-linux-gnu-library/3.6
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library

Enrichment Analysis: Mutant Vs Wild-Type

Steve Pederson

02 April, 2020

Setup

Introduction

Databases used for testing

Hallmark Gene Sets

KEGG Gene Sets

Gene Ontology Gene Sets

Enrichment in the DE Gene Set

Hallmark Gene Sets

KEGG Gene Sets

GO Gene Sets

Enrichment Testing on the Complete Set of Genes

Hallmark Gene Sets

Exploration of significant Hallmark Gene Sets

KEGG Gene Sets

Exploration of Significant KEGG Gene Sets

GO Gene Sets

Exploration of Significant GO Gene Sets

Genes Associated with the Mitochondrial Envelope

Genes Associated with RNA Metabolic Process

Genes Associated with Regulation Of Nucleobase-Containing Compound Metabolic Process

Genes Associated with Anion Transport

Genes Associated with Ribosomal Activity

Data Export