- Hypothesis Testing
- Multiple Testing Considerations
12 August 2019
In biological research we often ask:
We might be comparing:
How do we decide if our experimental results are “significant”?
Every experiment is considered as a random sample from all possible repeated experiments.
Most experiments involve measuring something:
Every experiment is considered as a random sample from all possible repeated experiments.
Many data collections can also be considered as experimental datasets
In the 1000 Genomes Project a risk allele for T1D has a frequency of \(\pi = 0.07\) in European Populations.
In our in vitro experiment, we found that 90% of HeLa cells were lysed by exposure to our drug.
All population parameters are considered to be fixed values, e.g.
All classical statistical testing involves:
Why do we do this?
An experimental hypothesis may be
\[ H_0: \mu = 0 \quad Vs \quad H_A: \mu \neq 0 \]
Where \(\mu\) represents the true average difference in a value (e.g. mRNA expression levels)
For every experiment we conduct we can get two key values:
1: The sample mean (\(\bar{x}\)) estimates the population-level mean (e.g. \(\mu\))
\[ \text{For} \quad \mathbf{x} = (x_1, x_2, ..., x_n) \\ \bar{x} = \frac{1}{n}\sum_{i = i}^n x_i \]
This will be a different value every time we repeat the experiment
This is an estimate of the true effect
For every experiment we conduct we can get two key values:
2: The sample variance (\(s^2\)) estimates the population-level variance (\(\sigma^2\))
\[ s^2 = \frac{1}{n-1} \sum_{i = 1}^n (x_i - \bar{x})^2 \]
This will also be a different value every time we repeat the experiment
where \(\mu\) is the average difference in FOXP3 expression in the entire population
Now we can get the sample mean:
\[ \begin{aligned} \bar{x} &= \frac{1}{n}\sum_{i = i}^n x_i \\ &= \frac{1}{4}(2.1 + 2.8 + 2.5 + 2.6) \\ &= 2.5 \end{aligned} \]
This is our estimate of the true mean difference in expression (\(\mu\))
And the sample variance:
\[ \begin{aligned} s^2 &= \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2\\ & = \frac{1}{3}\sum_{i = 1}^4 (x_i - 2.5)^2\\ &= 0.0867 \end{aligned} \]
\[ \mathbf{\bar{x}} = \{\bar{x}_1, \bar{x}_2, \dots, \bar{x}_m \} \]
This represents a theoretical set of repeated experiments with a different sample mean for each.
We usually just have one experiment (\(\bar{x}\)).
\[ \mathbf{\bar{x}} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}}) \]
where:
We know what our experimental results (\(\bar{x}\)) will look like.
\[ \bar{x} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}}) \]
If we subtract the population mean:
\[ \bar{x} - \mu \sim \mathcal{N}(0, \frac{\sigma}{\sqrt{n}}) \]
NB: We almost always test for no effect \(H_0: \mu = 0\)
\[ Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1) \]
\[ H_0: \mu = 0 \quad \text{vs} \quad H_A: \mu \neq 0 \]
If \(H_0\) is true, where would we expect \(Z\) to be?
If \(H_0\) is NOT true, where would we expect \(Z\) to be?
Would a value \(Z > 1\) be likely if \(H_0\) is TRUE?
Would a value \(>2\) be likely if \(H_0\) is TRUE
?
In our qPCR experiment, could the \(\Delta \Delta C_t\) values be either side of zero?
In our qPCR experiment, could the \(\Delta \Delta C_t\) values be either side of zero?
\[p(|Z| > 2) = p(Z > 2) + P(Z < -2)\]
This is the most common way of deciding if \(H_0\) is true!
\[ Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}} \]
A \(p\) value is the probability of observing data as extreme, or more extreme than we have observed, if \(H_0\) is true.
To summarise:
In reality, we will never know the population variance (\(\sigma^2\)), just like we will never know \(\mu\)
Due to the uncertainty introduced by using \(s^2\) instead of \(\sigma^2\) we can no longer compare to the \(Z \sim \mathcal{N}(0, 1)\) distribution.
Instead we use a \(T\)-statistic
\[ T = \frac{\bar{x} - \mu}{s / \sqrt{n}} \]
Then we compare to a \(t\)-distribution
The \(t\)-distribution is very similar to \(\mathcal{N}(0, 1)\)
At their simplest:
\[ \text{df} = \nu = n - 1 \]
As \(n\) increases \(s^2 \rightarrow \sigma^2\) and \(\implies t_{\nu} \rightarrow Z\)
For our qPCR data, testing \(\mu = 0\):
\[ \begin{aligned} T &= \frac{\bar{x} - \mu}{s / \sqrt{n}} \\ &= \frac{2.5 - 0}{0.294392 / \sqrt{4}} \\ &= 16.984 \end{aligned} \]
\[p = 0.00044\]
In the above we had the \(\Delta \Delta C_t\) values within each donor.
What if we just had 4 values from each cell-type from different donors?
For \(H_0: \mu_A = \mu_B\) Vs \(H_A: \mu_A \neq \mu_B\)
\[ T = \frac{\bar{x}_A - \bar{x}_B}{\text{SE}_{\bar{x}_A - \bar{x}_B}} \]
If \(H_0\) is true then
\[ T \sim t_{\nu} \]
When would data not be Normally Distributed?
Two useful tests:
\(H_0: \mu_A = \mu_B\) Vs \(H_A: \mu_A \neq \mu_B\)
An example table
A | B | |
---|---|---|
Upper Lakes | 12 | 12 |
Lower Plains | 20 | 4 |
\(H_0:\) No association between allele frequencies and location
\(H_A:\) There is an association between between allele frequencies and location
A \(p\) value is the probability of observing data as (or more) extreme if \(H_0\) is true.
We commonly reject \(H_0\) if \(p < 0.05\)
How often would we incorrectly reject \(H_0\)?
A \(p\) value is the probability of observing data as (or more) extreme if \(H_0\) is true.
We commonly reject \(H_0\) if \(p < 0.05\)
How often would we incorrectly reject \(H_0\)?
About 1 in 20 times, we will see \(p < 0.05\) if \(H_0\) is true
\(H_0\) TRUE |
\(H_0\) FALSE |
|
---|---|---|
Reject \(H_0\) | Type I Error | \(\checkmark\) |
Don’t Reject \(H_0\) | \(\checkmark\) | Type II Error |
What are the consequences of each type of error?
What are the consequences of each type of error?
Type I: Waste $$$ chasing dead ends
Type II: We miss a key discovery
How many times would we incorrectly reject \(H_0\) using \(p < 0.05\)
How many times would we incorrectly reject \(H_0\) using \(p < 0.05\)
We effectively have 25,000 tests, with 24,000 times \(H_0\) is true
\(\frac{25000 - 1000}{20} = 1200\) times
Could this lead to any research dead-ends?
This is an example of the Family-Wise Error Rate (i.e. Experiment-Wise Error Rate)
The Family-Wise Error Rate (FWER) is the probability of making one or more false rejections of \(H_0\)
In our example, the FWER \(\approx 1\)
What about if we lowered the rejection value to \(\alpha = 0.001\)?
We would incorrectly reject \(H_0\) once in every 1,000 times
\(\frac{25000 - 1000}{1000} = 24\) times
The FWER is still \(\approx 1\)
What are the consequences of this?
What are the consequences of this?
Most common procedure is the Benjamini-Hochberg
What advantage would this offer?
Most common procedure is the Benjamini-Hochberg
What advantage would this offer?
For those interested, the BH procedure for \(m\) tests is (not-examinable)