21 July 2016
myMean()Our previous approach was highly effective.
?sum
Note that this function has an na.rm argument already
What does the ... mean?
In R the ellipsis (...) is used to pass unspecified arguments down to other functions
myMean<- function(x, ...){
total <- sum(x, ...)
n <- sum(!is.na(x))
mn <- total/n
mn
}
If we now call myMean():
testVec <- c(NA, 1:10) myMean(testVec, na.rm = TRUE)
## [1] 5.5
It works!!!
We need to specify these arguments by name
If calling a function with an ... argument:
... is the first argument?plot
In Day_2/data is a file snps.csv
library(readr)
library(dplyr)
library(reshape2)
snps <- read_csv("data/snps.csv")
dim(snps)
## [1] 104 2001
Here we have genotypes coded as AA,AB or BB for 2000 SNP alleles
How many samples and populations do we have?
Here we have genotypes coded as AA,AB or BB for 2000 SNP alleles
snps %>% group_by(Population) %>% summarise(n = n())
## # A tibble: 2 x 2 ## Population n ## <chr> <int> ## 1 Control 53 ## 2 Treat 51
Write a function that performs a Fisher's Exact Test on each allele, comparing the allele structure across populations
For each SNP, we need to:
A and B)The best place to start is just working with a single SNP
What information do we need?
fisherFun<- function(snp, pop){
}
group_by() then summarise() acast()snps[1:2] %>% group_by(Population, SNP1) %>% summarise(count = n()) %>% filter(!is.na(SNP1)) %>% acast(Population~SNP1, value.var = "count")
## AA AB BB ## Control 13 30 10 ## Treat 8 19 21
Forming a table of genotypes:
table()table(snps$Population, snps$SNP1)
## ## AA AB BB ## Control 13 30 10 ## Treat 8 19 21
fisherFun<- function(snp, pop){
genoTable <- table(pop, snp)
return(genoTable)
}
fisherFun(snps$SNP1, snps$Population)
## snp ## pop AA AB BB ## Control 13 30 10 ## Treat 8 19 21
fisherFun<- function(snp, pop){
genoTable <- table(pop, snp)
alleleTable <- data.frame(A = 2*genoTable[,"AA"] + genoTable[,"AB"],
B = 2*genoTable[,"BB"] + genoTable[,"AB"])
return(alleleTable)
}
fisherFun(snps$SNP1, snps$Population)
## A B ## Control 56 50 ## Treat 35 61
fisherFun<- function(snp, pop){
genoTable <- table(pop, snp)
alleleTable <- data.frame(A = 2*genoTable[,"AA"] + genoTable[,"AB"],
B = 2*genoTable[,"BB"] + genoTable[,"AB"])
fTest <- fisher.test(alleleTable)
return(fTest)
}
fisherFun(snps$SNP1, snps$Population)
Is there any optional information we might like to pass to the Fisher Test?
?fisher.test
fisherFun<- function(snp, pop, ...){
genoTable <- table(pop, snp)
alleleTable <- data.frame(A = 2*genoTable[,"AA"] + genoTable[,"AB"],
B = 2*genoTable[,"BB"] + genoTable[,"AB"])
fTest <- fisher.test(alleleTable, ...)
return(fTest)
}
Compare the following
fisherFun(snps$SNP1, snps$Population) fisherFun(snps$SNP1, snps$Population, conf.level = 0.99)
Was there a difference?
Should we leave the ellipsis here?
What information should we keep?
Do we want:
A or B in each population?What information should we keep?
A or B in each population?Let's go with just A in each population and the \(p\)-value
fisherFun<- function(snp, pop, ...){
genoTable <- table(pop, snp)
alleleTable <- data.frame(A = 2*genoTable[,"AA"] + genoTable[,"AB"],
B = 2*genoTable[,"BB"] + genoTable[,"AB"])
freqs <- alleleTable[,"A"] / rowSums(alleleTable)
fTest <- fisher.test(alleleTable, ...)
return(freqs)
}
fisherFun<- function(snp, pop, ...){
genoTable <- table(pop, snp)
alleleTable <- data.frame(A = 2*genoTable[,"AA"] + genoTable[,"AB"],
B = 2*genoTable[,"BB"] + genoTable[,"AB"])
freqs <- alleleTable[,"A"] / rowSums(alleleTable)
fTest <- fisher.test(alleleTable, ...)
out <- as.list(freqs)
out$p.value <- fTest$p.value
as.data.frame(out)
}
fisherFun(snps$SNP1, snps$Population)
## Control Treat p.value ## 1 0.5283019 0.3645833 0.02359114
lapplyNow we have a function which works on one SNP
R has a very handy function lapply()
?lapply
lapplyThis stands for list apply()
listdata.frame is a listlapplyA simple example:
x <- list(int = 1:10, norm = rnorm(100))
Here we have made a list x with two components int and norm
lapply(x, mean)
lapplyWe could place any sensible function here
lapply(x, head) lapply(x, min) lapply(x, typeof)
lapplyNote the ...
lapply() places each element of the list (i.e. X) as the first argument of FUNmapply() for that if requiredlapply()In all the above, the output was a list the same length as the input.
sapply() attempts to simplify the outputsapply(x, mean) sapply(x, range)
lapply()In all the above, the output was a list the same length as the input.
vapply() requires the output structure be definedvapply(x, mean, numeric(1)) vapply(x, range, numeric(2))
lapply()The function apply() can be applied to matrices/arrays
?apply
MARGIN = 1 applies the function by rowMARGIN = 2 applies the function by columnMARGIN = 3 for 3-d arrays etc.Our data is a data.frame (i.e. a list)
\(\implies\) can use lapply()
vapply(snps, anyNA, logical(1))
We could even chain together functions
snps %>% lapply(is.na) %>% sapply(mean)
fisherFun()Our output is a data.frame
Should we use lapply() or sapply()?
fisherFun()Let's do a quick test
lapply(snps[2:3], fisherFun, pop = snps$Population) sapply(snps[2:3], fisherFun, pop = snps$Population)
fisherFun()One dplyr function we haven't seen yet: bind_rows()
snps[2:3] %>% lapply(fisherFun, pop = snps$Population) %>% bind_rows()
We can use this to combine all our results!
fisherFun()Here's my actual analysis
results <- snps[-1] %>%
lapply(fisherFun, pop = snps$Population) %>%
bind_rows() %>%
mutate(SNP = names(snps)[-1],
adjP = p.adjust(p.value, method = "bonferroni")) %>%
arrange(p.value) %>%
select(SNP, everything())
fisherFun()head(results)
## SNP Control Treat p.value adjP ## 1 SNP1716 0.6132075 0.0900000 5.995601e-16 1.199120e-12 ## 2 SNP1236 0.6320755 0.1078431 1.234748e-15 2.469497e-12 ## 3 SNP603 0.3846154 0.8921569 8.474637e-15 1.694927e-11 ## 4 SNP1271 0.4230769 0.9183673 1.357034e-14 2.714068e-11 ## 5 SNP1501 0.3301887 0.8431373 2.380582e-14 4.761164e-11 ## 6 SNP248 0.5849057 0.1000000 6.181343e-14 1.236269e-10
The package parallel has a function mclapply()
multicore lapply()library(parallel)
results <- snps[-1] %>%
mclapply(fisherFun, pop = snps$Population, mc.cores = 3) %>%
bind_rows() %>%
mutate(SNP = names(snps)[-1],
adjP = p.adjust(p.value, method = "bonferroni")) %>%
arrange(p.value) %>%
select(SNP, everything())
Check the times (using my dual-core i7 laptop)
times <- list(
series = system.time(lapply(snps[-1],
fisherFun,
pop = snps$Population)),
parallel = system.time(mclapply(snps[-1],
fisherFun,
pop = snps$Population,
mc.cores = 3))
)
With 3 cores: a speed up of 1.74-fold
Now imagine a set of 25,000 SNPs with pair-wise comparisons required for linkage
mclapply():
mc.cores identical R sessionsThe package snow allows us to configure each node manually
When writing functions, the order of arguments can be important
lapply()%>% places the output of a function into the first position of the next function