library(pander)
library(magrittr)
library(dplyr)
library(scales)
library(ggplot2)
library(ggrepel)
library(parallel)
library(readr)
library(reshape2)
logit <- binomial()$linkfun
inv.logit <- binomial()$linkinv
w <- 169/25.4
h <- 169/25.4

Introduction

This analysis takes the filtered SNPs from under analysis and simulates genetic drift under no selective pressure, in order to compare detected changes in allele frequencies to the range of values predicted by the drift model.

Outline of Drift Simulation

This analysis uses a custom set of scripts collected as an R package and available from https://github.com/steveped/driftSim. This is able to be installed as below.

devtools::install_github("steveped/driftSim")
library(driftSim)

Initialisation

nGen <- 16
nSim <- 5000
litterSize <- 20
migRate <- 0.15
alpha <- 0.001
Ne <- c(`1996` = 222, `2012` = 116)
pSurv <- 0.1
f0 <- seq(0.5, 0.95, by = 0.05)
nPops <- c(4, 8)
sigma <- c(0.5, 0.8)

As decided the population parameters were defined as:

Parameter Value Comment
Ne1996 222 Effective Population Size in 1996
Ne2012 116 Effective Population Size in 2012
p 0.1 Probability of survival for each generation
g 16 The number of generations between 1996 and 2012
n 4, 8 The number of neighbouring populations
l 20 The annual litter size
r 0.15 The migration rate
f0 0.5 to 0.95 The starting allele frequency in the main population, increased in steps of 0.05
σ 0.5, 0.8 The variability around f0 for neighbouring populations

No selective advantage was specified for any allele.

The simulation parameters were defined as:

Parameter Value Comment
nsim 5000 The number of simulations at each starting frequency f0
α 0.001 The significance level for plotting confidence bands

To summarise the above:

  1. The effective population size was defined as Ne = 222 representing the initial population in 1996.
  2. A starting major allele frequency was selected as one of f0 = 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9 and 0.95. Each rabbit was simulated as heterozygous or homozygous for either the major or minor allele using the initial starting frquency
  3. An initial survival rate was defined as p = 0.1, with this functioning as an initial bottleneck, and rabbits were assigned as survivors or fatalities with this probability.
  4. This population was considered as the central population, and this initialisation process was then repeated for either 4 or 8 neighbouring populations of the same size. However, variability was added to the initial allele frequencies on the logit scale using values of σ = 0.5 and 0.8. The same bottleneck procedure was applied to each of these populations.

Placing Variability in Context

nSamp <- 1e06
x_0 <- runif(nSamp, 0.5, 0.95)
y_0 <- sigma %>%
  lapply(function(sigma){
    inv.logit(rnorm(nSamp, logit(x_0), sigma))
  })
The effect of adding variability to derive initial frequencies in neighbouring populations. Half of the simulated initial values will be contained by the box for each starting frequency, with the remaining half being outside of these bounds.

The effect of adding variability to derive initial frequencies in neighbouring populations. Half of the simulated initial values will be contained by the box for each starting frequency, with the remaining half being outside of these bounds.

In order to express these values for σ in terms of correlations, 1,000,000 random values were simulated for a starting frequency (f0) anywhere between 0.5 and 0.95. 1,000,000 neighbouring population starting values were randomly sampled around this using the values of σ = 0.5 and 0.8, with the random sampling taking place on the logit scale.

Approximate correlations between starting allele frequencies in two populations, for the chosen values of \(\sigma\)
Sigma Correlation
0.5 0.81
0.8 0.65

Run Simulations

allParam <- expand.grid(f0 = f0,
                        mig = migRate,
                        n = nPops,
                        sd = sigma) %>%
  mutate(N0 = round(Ne["1996"], 0),
         Nt = round(Ne["2012"], 0),
         t = nGen,
         surv = pSurv,
         litter = litterSize) %>%
  split(f = factor(1:nrow(.)))
allSim <- allParam %>%
  mclapply(function(x){
    x <- as.list(x)
    replicate(nSim, do.call(simDrift, args = x)$ft)
  },
  mc.cores = 20)

Inspection of Simulations

As expected the mean allele frequency after simulation of drift remained approximately equal to the initial starting frquency, however the variability of the final frequency was heavily affected by the number of neighbouring populations and their similarity to the to main population of interest.

Mean and standard devation of allele frequencies ($f_t$) after simulating drift. Polynomial lines of best fit are shown for each value.

Mean and standard devation of allele frequencies (\(f_t\)) after simulating drift. Polynomial lines of best fit are shown for each value.

  • More populations surrounding the main population tended to hold allele frequencies steady over time
  • More varibility between populations resulted in wider distributions, i.e. more variable allele frequencies, after allowing for drift
  • The Gum Creek / Oraparinna population is not an isolated population, but will likely experience migration from a continuous larger population. The variability within this larger popaulation is unknown.
  • The most conservative approach to apply in analysis would be to assume a small number of neighbouring populations, which are highly genetically divergent, giving the greates room for variable drift. Observed changes in allele frequencies beyond extreme simulated values would then be more likely to contain an element of selective pressure.
allSim99 <- allSim %>%
  lapply(quantile, probs = c(alpha/2, 1-alpha/2)) %>%
  lapply(as.list) %>%
  lapply(as.data.frame) %>%
  bind_rows() %>%
  set_names(c("lwr", "upr"))
driftIntervals <- allParam %>%
  bind_rows() %>%
  dplyr::select(f0, n, sd) %>%
  as_data_frame() %>%
  cbind(allSim99)
Expected 99.9% intervals for individual allele frequencies under each set of criteria used in generation of simulations.

Expected 99.9% intervals for individual allele frequencies under each set of criteria used in generation of simulations.

Comparison With Observed Data

flkResults <- file.path("..", "results", "flkResults.tsv") %>%
  read_tsv
genotypeResults <- file.path("..", "results", "genotypeResults.tsv") %>%
  read_tsv
sigSnps <- c(filter(flkResults, FDR < 0.05)$snpID,
             filter(genotypeResults, FDR < 0.1)$snpID) %>%
  unique
## png 
##   2
Allele frequencies in the two populations. SNPs considered as significant under either the Full Genotype (FDR = 0.1) or FLK (FDR = 0.05) models are highlighted in green. Select SNPs considered significant under either model are labelled. 99.9% prediction intervals are indicated by the bands for differing simulation parameters. Intervals generated by 4 neighbouring populations are shown in red, whilst intervals generated by 8 neighbouring populations are shown in blue. Neigbouring populations with the closest similarity to the main population are shown with solid lines, whilst those with less similarity are shown as dashed lines. The two points which clearly fall within all bands were found significant  under the full genotype model and show differencec in heterozygosity rather than in the allele frequencies themselves. The inner blue bands represent the expected scenario which most closely resembles the ecological system under investigation, with a continuous population of rabbits (i.e. a large number of neighbouring populations), with highly correlated allele frequencies.

Allele frequencies in the two populations. SNPs considered as significant under either the Full Genotype (FDR = 0.1) or FLK (FDR = 0.05) models are highlighted in green. Select SNPs considered significant under either model are labelled. 99.9% prediction intervals are indicated by the bands for differing simulation parameters. Intervals generated by 4 neighbouring populations are shown in red, whilst intervals generated by 8 neighbouring populations are shown in blue. Neigbouring populations with the closest similarity to the main population are shown with solid lines, whilst those with less similarity are shown as dashed lines. The two points which clearly fall within all bands were found significant under the full genotype model and show differencec in heterozygosity rather than in the allele frequencies themselves. The inner blue bands represent the expected scenario which most closely resembles the ecological system under investigation, with a continuous population of rabbits (i.e. a large number of neighbouring populations), with highly correlated allele frequencies.

save.image("../temp/05_GeneticDrift.RData")