Introduction
- Outline of selection process for SNPs to be analysed
Data Setup
SNP Filtering And Selection
Correction of P & Q Nucleotides
Editing of genpop data

library(tidyverse)
library(magrittr)
library(rtracklayer)
library(GenomicRanges)
library(parallel)
library(reshape2)
library(pander)
library(scales)
library(stringr)
nCores <- min(12, detectCores() - 1)

Introduction

This file takes the raw output from Stacks and filters the complete set of SNPs to arrive at the final list of SNPs used for downstream analysis. The two populations are referred to below as Population 1 (Gum Creek, 1996) and Population 2 (Oraparinna, 2012).

Outline of selection process for SNPs to be analysed

The filtering steps involved are:

Removal of SNPs located on sex chromosomes
Removal of SNPs within the PstI restriction site errors
Removal of fixed alleles
Removal of SNPs not observed in 1996 as these would be likely due to migration not selection
Removal of SNPs identified in <75% of both populations
Removal of SNPs within the same GBS tag, with identical allele frequencies
Removal of SNPs called by Stacks to be in separate tags, but which share identical genomic co-ordinates
Selection of SNPs within 40kb of a gene

Stacks identifies P (major) and Q (minor) alleles within each population, and an additional processing step was included to ensure the P allele in both populations was the SNP determined to be the P allele in the 1996 population.

Data Setup

Genome Setup

Whilst aligned to the RefSeq/NCBI genome build, for compatibility with Ensembl gene models, a mapping object was generated from the RefSeq Assembly Report

ncbiReport <- file.path("..", "data", "GCF_000003625.3_OryCun2.0_assembly_report.txt") %>%
  read_delim(delim = "\t", col_names = FALSE, na = "na", comment = "#") %>%
  set_colnames(c("Sequence-Name", "Sequence-Role", "Assigned-Molecule", "Assigned-Molecule-Location/Type", "GenBank-Accn", "Relationship", "RefSeq-Accn", "Assembly-Unit", "Sequence-Length", "UCSC-style-name")) %>%
  filter(!is.na(`RefSeq-Accn`))
ensUsesGB <- grepl("scaffold",ncbiReport$`Sequence-Role`)
ncbiReport$Ensembl <- ncbiReport$`Sequence-Name`
ncbiReport$Ensembl[ensUsesGB] <- gsub("\\.[0-9]$", "", ncbiReport$`GenBank-Accn`[ensUsesGB])
ncbi2Ens <- with(ncbiReport, data.frame(Chr = Ensembl, 
                                        Length = `Sequence-Length`,
                                        row.names = `RefSeq-Accn`))

Gene information was obtained from Ensembl Build 84 and loaded

ensGff <- file.path("..", "data", "Oryctolagus_cuniculus.OryCun2.0.84.gff3.gz")
ensGenes <- import.gff3(ensGff, feature.type = "gene", sequenceRegionsAsSeqinfo = TRUE)

Loading SNP Data

sumData <- file.path("..", "data", "batch_3.sumstats.tsv.gz") %>%
  gzfile() %>%
  read_delim(delim = "\t", col_names = TRUE, skip = 3) %>%
  dplyr::select(-contains("Batch"), -contains("Pi"), -contains("Fis")) %>%
  mutate(RefSeq = gsub("^gi\\|.+\\|ref\\|(.+)\\|$", "\\1", Chr),
         Chr = as.character(ncbi2Ens[RefSeq, "Chr"]),
         snpID = paste(`Locus ID`, Col, sep="_"),
         Private = as.logical(Private)) %>%
  filter(`Pop ID` != 3,
         Chr != "X",
         Col > 4) %>%
  dplyr::select(Chr, BP, `Locus ID`, Col, snpID, 
                `Pop ID`, contains("Nuc"),  N, P, 
                contains("Obs"), contains("Exp"), Private)

The stacks summary output was loaded immediately removing:

SNPs on Chromosome X
SNPs within the restriction site
Data from the Turretfield outgroup, which was not required for this stage of the analysis

This object contained information for 155,455 SNPs across 65,676 GBS tags.

Conversion of SNPs to GRanges

The genomic locations of all SNPs were then prepared as a GRanges object.

snp2GR <- sumData %>%
  distinct(snpID, .keep_all = TRUE) %>%
  makeGRangesFromDataFrame(keep.extra.columns=TRUE,
                           seqinfo = seqinfo(ensGenes),
                           seqnames.field = "Chr",
                           start.field = "BP",
                           end.field = "BP",
                           ignore.strand = TRUE) %>%
  extract(, c("Locus ID", "snpID")) %>%
  set_names(.$snpID) %>%
  sort

SNP Filtering And Selection

Uninformative SNPs

Fixed SNPs were defined as P > 0.95 in both populations or P = 0.5 in both populations

fixedSNPs <- sumData %>%
  group_by(snpID) %>%
  summarise(minP = min(P),
            maxP = max(P),
            both50 = all(minP == 0.5, maxP == 0.5)) %>%
  filter(minP > 0.95 | both50) %>%
  extract2("snpID")

SNPs not detected in the 1996 population were identified as these were likely due to migration or some other factor, not selection

uniq2Ora <- filter(sumData, `Pop ID` == 1, P == 1)$snpID

SNPs were also excluded if not identified in > 75% of both populations

minSamp <- sumData %>% 
  group_by(`Pop ID`) %>% 
  summarise(N = max(N)) %>%
  mutate(minN = ceiling(0.75*N)) %>%
  dplyr::select(`Pop ID`, N, minN)

Minimum sample sizes required in each population.
Pop ID	N	Min Required
1	55	42
2	53	40

lowSamp <- sumData %>%
  left_join(dplyr::select(minSamp, -N)) %>%
  mutate(keep = N >= minN) %>%
  group_by(snpID) %>%
  summarise(keep = sum(keep) == 2) %>%
  filter(!keep) %>%
  extract2("snpID")

These three filtering steps identified SNPs for exclusion as seen in the following table:

SNPs identified for removal
Reason For Removal	SNPs Marked For Removal
Fixed Alleles	30,336
Not Found in 1996	9,153
Identified in < 75%	77,280
Total	94,679

Duplicate SNPs

bestInTag <- sumData %>% 
  filter(!snpID %in% c(fixedSNPs, uniq2Ora, lowSamp)) %>%
  dplyr::select(`Locus ID`, Col, snpID, `Pop ID`, N) %>%
  dcast(`Locus ID` + Col + snpID ~ `Pop ID`) %>% 
  distinct(`Locus ID`, `1`, `2`, .keep_all = TRUE) %>%
  extract2("snpID")

Where multiple SNPs were present in a tag with identical allele frequencies and sample sizes, one was selected at random
This marked 18,509 SNPs as containing duplicated information

bestDuplicate <- sumData %>% 
  filter(snpID %in% bestInTag) %>%
  group_by(Chr, BP, snpID) %>%
  summarise(N = sum(N)) %>%
  ungroup() %>%
  mutate(`Locus ID` = gsub("([0-9]+)_[0-9]","\\1", snpID)) %>%
  arrange(Chr, BP, N, `Locus ID`) %>%
  distinct(Chr, BP, .keep_all = TRUE) %>%
  extract2("snpID")

Some SNPs were identified at the same genomic location in two loci and these were assumed to be due to differing cut sites, or tags aligned to complementary strands
If samples sizes were identical, otherwise the SNP with the largest total samples was chosen
This gave 31,386 candidate SNPs for downstream analysis.

Selection of SNPs Near Genes

finalSNPs <- snp2GR %>%
  subset(snpID %in% bestDuplicate) %>%
  resize(80001, fix = "center") %>%
  trim() %>%
  subsetByOverlaps(ensGenes) %>%
  names()

nGenes <- snp2GR %>%
  extract(finalSNPs) %>%
  resize(80001, fix = "center") %>%
  trim() %>%
  findOverlaps(ensGenes) %>%
  subjectHits() %>%
  unique() %>%
  length()

Only SNPs within 40kb of a known gene were then retained, giving the final total of 20,336 SNPs for downstream analysis, which potentially provided markers for 10,078 genes.

Genomic Coverage After SNP Filtering

winSize <- 1e05
windows <- tileGenome(seqinfo(snp2GR)[as.character(1:21),], tilewidth = 1e05)
windowCounts <- countOverlaps(windows, subset(snp2GR, snpID %in% finalSNPs))

Using 100kb tiled windows across the autosomes, this final list of SNPs gave 16827 windows without a SNP, and 4534 windows with at least one SNP. This means that 21.2% of windows were sampled to a high enough quality using the GBS approach. Of the windows with a SNP, the median number of SNPs in each window was 2, with the most densely sampled window containing 31 SNPs. This is in comparison to the dataset before filtering of low-quality/duplicate SNPs, in which 55% of the 100kb windows contained at least one SNP.

Correction of P & Q Nucleotides

P and Q alleles are defined separately for each population, and need to be redefined such that these alleles are common across populations. Define each SNP such that the P (i.e. most common) allele in Population 1 is considered as the reference allele.

sumData %<>%
  filter(snpID %in% finalSNPs) %>%
  split(f = .$snpID) %>%
  mclapply(function(x){
    if (length(unique(x$`P Nuc`)) == 2){ # If there are 2 P alleles
      x[2, c("P Nuc", "Q Nuc")] <- x[2, c("Q Nuc", "P Nuc")]
      x$P[2] <- 1 - x$P[2]
    }
    x
  }, mc.cores = nCores) %>%
  bind_rows

gzOut <- file.path("..", "data", "filteredSNPs.tsv.gz") %>%
  gzfile("w")
sumData %>%
  dplyr::select(-Private) %>%
  write_tsv(gzOut)
close(gzOut)

This final object was exported as the object filteredSNPs.tsv.gz

Editing of genpop data

The genpop file as output by stacks was loaded into R and converted to genotypes manually.

First extract all the setup information

gpFile <- file.path("..", "data", "batch_3.genepop.gz") %>% gzfile()
gpData <- readLines(gpFile)
close(gpFile)
popStart <- grep("^pop", gpData) + 1
popEnd <- c(popStart[-1] - 2, length(gpData))
nPops <- length(popStart)
sampleRows <- seq_along(popStart) %>%
  sapply(function(x){seq(popStart[x], popEnd[x])}) %>%
  unlist()
sampleIDs <- gsub("(gc|ora|TF|tf|Y|pt)([0-9ABC]+),.+", "\\1\\2", gpData[sampleRows])
snpIDs <- gpData[2] %>% str_split(",") %>% extract2(1)

Now get the alleles using the existing allele encoding
Then restrict to the SNPs being analysed

alleles <- gsub("(gc|ora|TF|tf|Y|pt)[0-9ABC]+,\\t", "", gpData[sampleRows]) %>%
  str_split_fixed("\t", n = length(snpIDs)) %>%
  replace(. == "0000", NA) %>%
  set_colnames(snpIDs) %>%
  set_rownames(sampleIDs) %>%
  extract(,finalSNPs)

Convert to minor allele counts

minorAlleleCounts <- alleles %>%
  apply(MARGIN = 2, function(x){
    allNums <- sort(unique(x[!is.na(x)])) # The existing allele combinations
    allIDs <- unique(c(substr(allNums, 1, 2),
                       substr(allNums, 3, 4))) # The existing allele numbers
    # Create all possible genotypes
    gTypes <- c(paste0(allIDs[1], allIDs[1]),
                  paste0(allIDs[1], allIDs[2]),
                  paste0(allIDs[2], allIDs[2]))
    # Find the maximum allele based on homozygotes across all populations
    homs <- vapply(gTypes[c(1, 3)],function(y){sum(grepl(y, x), na.rm = TRUE)}, integer(1))
    P <- which.max(homs)
    # Reverse the genotypes & use as factor levels
    if (P == 2) gTypes <- rev(gTypes)
    as.integer(factor(x, levels = gTypes)) - 1
  }) %>%
  set_rownames(sampleIDs)

gzOut <- file.path("..", "data", "minorAlleleCounts.tsv.gz") %>%
  gzfile("w")
minorAlleleCounts %>%
  as.data.frame() %>%
  rownames_to_column("sampleID") %>%
  write_tsv(gzOut)
close(gzOut)

This file was then exported as minorAlleleCounts.tsv.gz. An additional genepop file was created for use with Bayescan and for estimation of effective population sizes.

customGenepopFile <- file.path("..", "data", "filteredSNPs.genepop.gz") 
# The header
write_lines(x = "Created by custom script; Genepop version 4.1.3; Nov 05, 2017", 
           path = gzfile(customGenepopFile))
# Allele names
colnames(alleles) %>% 
  paste(collapse = ",") %>%
  write_lines(gzfile(customGenepopFile), append = TRUE)
# The first population
write_lines("pop", gzfile(customGenepopFile), append = TRUE)
# Allele data for each gc sample
grep("gc", rownames(alleles), value = TRUE) %>%
  lapply(function(x){
    alls <- replace(alleles[x,], is.na(alleles[x,]), "0000") %>%
      paste(collapse = "\t")
    paste(x, alls, sep = ",") %>%
      write_lines(gzfile(customGenepopFile), append = TRUE)
  })
# The next population
write_lines("pop", gzfile(customGenepopFile), append = TRUE)
# Allele data for each ora sample
grep("ora", rownames(alleles), value = TRUE) %>%
  lapply(function(x){
    alls <- replace(alleles[x,], is.na(alleles[x,]), "0000") %>%
      paste(collapse = "\t")
    paste(x, alls, sep = ",") %>%
      write_lines(gzfile(customGenepopFile), append = TRUE)
  })

R version 3.4.3 (2017-11-30)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_AU.UTF-8, LC_NUMERIC=C, LC_TIME=en_AU.UTF-8, LC_COLLATE=en_AU.UTF-8, LC_MONETARY=en_AU.UTF-8, LC_MESSAGES=en_AU.UTF-8, LC_PAPER=en_AU.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_AU.UTF-8 and LC_IDENTIFICATION=C

attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: bindrcpp(v.0.2), scales(v.0.5.0), pander(v.0.6.1), reshape2(v.1.4.3), rtracklayer(v.1.38.3), GenomicRanges(v.1.30.1), GenomeInfoDb(v.1.14.0), IRanges(v.2.12.0), S4Vectors(v.0.16.0), BiocGenerics(v.0.24.0), magrittr(v.1.5), forcats(v.0.2.0), stringr(v.1.2.0), dplyr(v.0.7.4), purrr(v.0.2.4), readr(v.1.1.1), tidyr(v.0.7.2), tibble(v.1.4.2), ggplot2(v.2.2.1) and tidyverse(v.1.2.1)

loaded via a namespace (and not attached): Rcpp(v.0.12.15), lubridate(v.1.7.1), lattice(v.0.20-35), Rsamtools(v.1.30.0), Biostrings(v.2.46.0), assertthat(v.0.2.0), rprojroot(v.1.3-2), digest(v.0.6.14), psych(v.1.7.8), R6(v.2.2.2), cellranger(v.1.1.0), plyr(v.1.8.4), backports(v.1.1.2), evaluate(v.0.10.1), httr(v.1.3.1), pillar(v.1.1.0), zlibbioc(v.1.24.0), rlang(v.0.1.6), lazyeval(v.0.2.1), readxl(v.1.0.0), rstudioapi(v.0.7), Matrix(v.1.2-12), rmarkdown(v.1.8), BiocParallel(v.1.12.0), foreign(v.0.8-69), RCurl(v.1.95-4.10), munsell(v.0.4.3), DelayedArray(v.0.4.1), broom(v.0.4.3), compiler(v.3.4.3), modelr(v.0.1.1), pkgconfig(v.2.0.1), mnormt(v.1.5-5), htmltools(v.0.3.6), SummarizedExperiment(v.1.8.1), GenomeInfoDbData(v.1.0.0), matrixStats(v.0.53.0), XML(v.3.98-1.9), crayon(v.1.3.4), GenomicAlignments(v.1.14.1), bitops(v.1.0-6), grid(v.3.4.3), nlme(v.3.1-131), jsonlite(v.1.5), gtable(v.0.2.0), cli(v.1.0.0), stringi(v.1.1.6), XVector(v.0.18.0), xml2(v.1.2.0), tools(v.3.4.3), Biobase(v.2.38.0), glue(v.1.2.0), hms(v.0.4.1), yaml(v.2.1.16), colorspace(v.1.3-2), rvest(v.0.3.2), knitr(v.1.18), bindr(v.0.1) and haven(v.1.1.1)

SNP Filtering

Steve Pederson

29 January, 2018