library(tidyverse)
library(magrittr)
library(rtracklayer)
library(GenomicRanges)
library(parallel)
library(reshape2)
library(pander)
library(scales)
library(stringr)
nCores <- min(12, detectCores() - 1)
This file takes the raw output from Stacks
and filters the complete set of SNPs to arrive at the final list of SNPs used for downstream analysis. The two populations are referred to below as Population 1
(Gum Creek, 1996) and Population 2
(Oraparinna, 2012).
The filtering steps involved are:
Stacks
to be in separate tags, but which share identical genomic co-ordinatesStacks
identifies P
(major) and Q
(minor) alleles within each population, and an additional processing step was included to ensure the P
allele in both populations was the SNP determined to be the P
allele in the 1996 population.
Whilst aligned to the RefSeq/NCBI genome build, for compatibility with Ensembl gene models, a mapping object was generated from the RefSeq Assembly Report
ncbiReport <- file.path("..", "data", "GCF_000003625.3_OryCun2.0_assembly_report.txt") %>%
read_delim(delim = "\t", col_names = FALSE, na = "na", comment = "#") %>%
set_colnames(c("Sequence-Name", "Sequence-Role", "Assigned-Molecule", "Assigned-Molecule-Location/Type", "GenBank-Accn", "Relationship", "RefSeq-Accn", "Assembly-Unit", "Sequence-Length", "UCSC-style-name")) %>%
filter(!is.na(`RefSeq-Accn`))
ensUsesGB <- grepl("scaffold",ncbiReport$`Sequence-Role`)
ncbiReport$Ensembl <- ncbiReport$`Sequence-Name`
ncbiReport$Ensembl[ensUsesGB] <- gsub("\\.[0-9]$", "", ncbiReport$`GenBank-Accn`[ensUsesGB])
ncbi2Ens <- with(ncbiReport, data.frame(Chr = Ensembl,
Length = `Sequence-Length`,
row.names = `RefSeq-Accn`))
Gene information was obtained from Ensembl Build 84 and loaded
ensGff <- file.path("..", "data", "Oryctolagus_cuniculus.OryCun2.0.84.gff3.gz")
ensGenes <- import.gff3(ensGff, feature.type = "gene", sequenceRegionsAsSeqinfo = TRUE)
sumData <- file.path("..", "data", "batch_3.sumstats.tsv.gz") %>%
gzfile() %>%
read_delim(delim = "\t", col_names = TRUE, skip = 3) %>%
dplyr::select(-contains("Batch"), -contains("Pi"), -contains("Fis")) %>%
mutate(RefSeq = gsub("^gi\\|.+\\|ref\\|(.+)\\|$", "\\1", Chr),
Chr = as.character(ncbi2Ens[RefSeq, "Chr"]),
snpID = paste(`Locus ID`, Col, sep="_"),
Private = as.logical(Private)) %>%
filter(`Pop ID` != 3,
Chr != "X",
Col > 4) %>%
dplyr::select(Chr, BP, `Locus ID`, Col, snpID,
`Pop ID`, contains("Nuc"), N, P,
contains("Obs"), contains("Exp"), Private)
The stacks
summary output was loaded immediately removing:
This object contained information for 155,455 SNPs across 65,676 GBS tags.
The genomic locations of all SNPs were then prepared as a GRanges
object.
snp2GR <- sumData %>%
distinct(snpID, .keep_all = TRUE) %>%
makeGRangesFromDataFrame(keep.extra.columns=TRUE,
seqinfo = seqinfo(ensGenes),
seqnames.field = "Chr",
start.field = "BP",
end.field = "BP",
ignore.strand = TRUE) %>%
extract(, c("Locus ID", "snpID")) %>%
set_names(.$snpID) %>%
sort
P > 0.95
in both populations or P = 0.5
in both populationsfixedSNPs <- sumData %>%
group_by(snpID) %>%
summarise(minP = min(P),
maxP = max(P),
both50 = all(minP == 0.5, maxP == 0.5)) %>%
filter(minP > 0.95 | both50) %>%
extract2("snpID")
uniq2Ora <- filter(sumData, `Pop ID` == 1, P == 1)$snpID
minSamp <- sumData %>%
group_by(`Pop ID`) %>%
summarise(N = max(N)) %>%
mutate(minN = ceiling(0.75*N)) %>%
dplyr::select(`Pop ID`, N, minN)
Pop ID | N | Min Required |
---|---|---|
1 | 55 | 42 |
2 | 53 | 40 |
lowSamp <- sumData %>%
left_join(dplyr::select(minSamp, -N)) %>%
mutate(keep = N >= minN) %>%
group_by(snpID) %>%
summarise(keep = sum(keep) == 2) %>%
filter(!keep) %>%
extract2("snpID")
These three filtering steps identified SNPs for exclusion as seen in the following table:
Reason For Removal | SNPs Marked For Removal |
---|---|
Fixed Alleles | 30,336 |
Not Found in 1996 | 9,153 |
Identified in < 75% | 77,280 |
Total | 94,679 |
bestInTag <- sumData %>%
filter(!snpID %in% c(fixedSNPs, uniq2Ora, lowSamp)) %>%
dplyr::select(`Locus ID`, Col, snpID, `Pop ID`, N) %>%
dcast(`Locus ID` + Col + snpID ~ `Pop ID`) %>%
distinct(`Locus ID`, `1`, `2`, .keep_all = TRUE) %>%
extract2("snpID")
bestDuplicate <- sumData %>%
filter(snpID %in% bestInTag) %>%
group_by(Chr, BP, snpID) %>%
summarise(N = sum(N)) %>%
ungroup() %>%
mutate(`Locus ID` = gsub("([0-9]+)_[0-9]","\\1", snpID)) %>%
arrange(Chr, BP, N, `Locus ID`) %>%
distinct(Chr, BP, .keep_all = TRUE) %>%
extract2("snpID")
finalSNPs <- snp2GR %>%
subset(snpID %in% bestDuplicate) %>%
resize(80001, fix = "center") %>%
trim() %>%
subsetByOverlaps(ensGenes) %>%
names()
nGenes <- snp2GR %>%
extract(finalSNPs) %>%
resize(80001, fix = "center") %>%
trim() %>%
findOverlaps(ensGenes) %>%
subjectHits() %>%
unique() %>%
length()
Only SNPs within 40kb of a known gene were then retained, giving the final total of 20,336 SNPs for downstream analysis, which potentially provided markers for 10,078 genes.
winSize <- 1e05
windows <- tileGenome(seqinfo(snp2GR)[as.character(1:21),], tilewidth = 1e05)
windowCounts <- countOverlaps(windows, subset(snp2GR, snpID %in% finalSNPs))
Using 100kb tiled windows across the autosomes, this final list of SNPs gave 16827 windows without a SNP, and 4534 windows with at least one SNP. This means that 21.2% of windows were sampled to a high enough quality using the GBS approach. Of the windows with a SNP, the median number of SNPs in each window was 2, with the most densely sampled window containing 31 SNPs. This is in comparison to the dataset before filtering of low-quality/duplicate SNPs, in which 55% of the 100kb windows contained at least one SNP.
P and Q alleles are defined separately for each population, and need to be redefined such that these alleles are common across populations. Define each SNP such that the P (i.e. most common) allele in Population 1 is considered as the reference allele.
sumData %<>%
filter(snpID %in% finalSNPs) %>%
split(f = .$snpID) %>%
mclapply(function(x){
if (length(unique(x$`P Nuc`)) == 2){ # If there are 2 P alleles
x[2, c("P Nuc", "Q Nuc")] <- x[2, c("Q Nuc", "P Nuc")]
x$P[2] <- 1 - x$P[2]
}
x
}, mc.cores = nCores) %>%
bind_rows
gzOut <- file.path("..", "data", "filteredSNPs.tsv.gz") %>%
gzfile("w")
sumData %>%
dplyr::select(-Private) %>%
write_tsv(gzOut)
close(gzOut)
This final object was exported as the object filteredSNPs.tsv.gz
The genpop
file as output by stacks
was loaded into R
and converted to genotypes manually.
gpFile <- file.path("..", "data", "batch_3.genepop.gz") %>% gzfile()
gpData <- readLines(gpFile)
close(gpFile)
popStart <- grep("^pop", gpData) + 1
popEnd <- c(popStart[-1] - 2, length(gpData))
nPops <- length(popStart)
sampleRows <- seq_along(popStart) %>%
sapply(function(x){seq(popStart[x], popEnd[x])}) %>%
unlist()
sampleIDs <- gsub("(gc|ora|TF|tf|Y|pt)([0-9ABC]+),.+", "\\1\\2", gpData[sampleRows])
snpIDs <- gpData[2] %>% str_split(",") %>% extract2(1)
alleles <- gsub("(gc|ora|TF|tf|Y|pt)[0-9ABC]+,\\t", "", gpData[sampleRows]) %>%
str_split_fixed("\t", n = length(snpIDs)) %>%
replace(. == "0000", NA) %>%
set_colnames(snpIDs) %>%
set_rownames(sampleIDs) %>%
extract(,finalSNPs)
minorAlleleCounts <- alleles %>%
apply(MARGIN = 2, function(x){
allNums <- sort(unique(x[!is.na(x)])) # The existing allele combinations
allIDs <- unique(c(substr(allNums, 1, 2),
substr(allNums, 3, 4))) # The existing allele numbers
# Create all possible genotypes
gTypes <- c(paste0(allIDs[1], allIDs[1]),
paste0(allIDs[1], allIDs[2]),
paste0(allIDs[2], allIDs[2]))
# Find the maximum allele based on homozygotes across all populations
homs <- vapply(gTypes[c(1, 3)],function(y){sum(grepl(y, x), na.rm = TRUE)}, integer(1))
P <- which.max(homs)
# Reverse the genotypes & use as factor levels
if (P == 2) gTypes <- rev(gTypes)
as.integer(factor(x, levels = gTypes)) - 1
}) %>%
set_rownames(sampleIDs)
gzOut <- file.path("..", "data", "minorAlleleCounts.tsv.gz") %>%
gzfile("w")
minorAlleleCounts %>%
as.data.frame() %>%
rownames_to_column("sampleID") %>%
write_tsv(gzOut)
close(gzOut)
This file was then exported as minorAlleleCounts.tsv.gz
. An additional genepop file was created for use with Bayescan and for estimation of effective population sizes.
customGenepopFile <- file.path("..", "data", "filteredSNPs.genepop.gz")
# The header
write_lines(x = "Created by custom script; Genepop version 4.1.3; Nov 05, 2017",
path = gzfile(customGenepopFile))
# Allele names
colnames(alleles) %>%
paste(collapse = ",") %>%
write_lines(gzfile(customGenepopFile), append = TRUE)
# The first population
write_lines("pop", gzfile(customGenepopFile), append = TRUE)
# Allele data for each gc sample
grep("gc", rownames(alleles), value = TRUE) %>%
lapply(function(x){
alls <- replace(alleles[x,], is.na(alleles[x,]), "0000") %>%
paste(collapse = "\t")
paste(x, alls, sep = ",") %>%
write_lines(gzfile(customGenepopFile), append = TRUE)
})
# The next population
write_lines("pop", gzfile(customGenepopFile), append = TRUE)
# Allele data for each ora sample
grep("ora", rownames(alleles), value = TRUE) %>%
lapply(function(x){
alls <- replace(alleles[x,], is.na(alleles[x,]), "0000") %>%
paste(collapse = "\t")
paste(x, alls, sep = ",") %>%
write_lines(gzfile(customGenepopFile), append = TRUE)
})
R version 3.4.3 (2017-11-30)
**Platform:** x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_AU.UTF-8, LC_NUMERIC=C, LC_TIME=en_AU.UTF-8, LC_COLLATE=en_AU.UTF-8, LC_MONETARY=en_AU.UTF-8, LC_MESSAGES=en_AU.UTF-8, LC_PAPER=en_AU.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_AU.UTF-8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: bindrcpp(v.0.2), scales(v.0.5.0), pander(v.0.6.1), reshape2(v.1.4.3), rtracklayer(v.1.38.3), GenomicRanges(v.1.30.1), GenomeInfoDb(v.1.14.0), IRanges(v.2.12.0), S4Vectors(v.0.16.0), BiocGenerics(v.0.24.0), magrittr(v.1.5), forcats(v.0.2.0), stringr(v.1.2.0), dplyr(v.0.7.4), purrr(v.0.2.4), readr(v.1.1.1), tidyr(v.0.7.2), tibble(v.1.4.2), ggplot2(v.2.2.1) and tidyverse(v.1.2.1)
loaded via a namespace (and not attached): Rcpp(v.0.12.15), lubridate(v.1.7.1), lattice(v.0.20-35), Rsamtools(v.1.30.0), Biostrings(v.2.46.0), assertthat(v.0.2.0), rprojroot(v.1.3-2), digest(v.0.6.14), psych(v.1.7.8), R6(v.2.2.2), cellranger(v.1.1.0), plyr(v.1.8.4), backports(v.1.1.2), evaluate(v.0.10.1), httr(v.1.3.1), pillar(v.1.1.0), zlibbioc(v.1.24.0), rlang(v.0.1.6), lazyeval(v.0.2.1), readxl(v.1.0.0), rstudioapi(v.0.7), Matrix(v.1.2-12), rmarkdown(v.1.8), BiocParallel(v.1.12.0), foreign(v.0.8-69), RCurl(v.1.95-4.10), munsell(v.0.4.3), DelayedArray(v.0.4.1), broom(v.0.4.3), compiler(v.3.4.3), modelr(v.0.1.1), pkgconfig(v.2.0.1), mnormt(v.1.5-5), htmltools(v.0.3.6), SummarizedExperiment(v.1.8.1), GenomeInfoDbData(v.1.0.0), matrixStats(v.0.53.0), XML(v.3.98-1.9), crayon(v.1.3.4), GenomicAlignments(v.1.14.1), bitops(v.1.0-6), grid(v.3.4.3), nlme(v.3.1-131), jsonlite(v.1.5), gtable(v.0.2.0), cli(v.1.0.0), stringi(v.1.1.6), XVector(v.0.18.0), xml2(v.1.2.0), tools(v.3.4.3), Biobase(v.2.38.0), glue(v.1.2.0), hms(v.0.4.1), yaml(v.2.1.16), colorspace(v.1.3-2), rvest(v.0.3.2), knitr(v.1.18), bindr(v.0.1) and haven(v.1.1.1)