Jimmy Breen & Steve Pederson
While we seem to have called a lot of variants on chrI, have a look at the sequence that the chromosome starts off with:
head chrI.fa
What can you say about the sequence?
Just because a variant is called, does not mean that it is a true positive! Each variant called within the file holds a variant quality score (found in the QUAL field). From the VCF format specifications:
QUAL phred-scaled quality score for the assertion made in ALT. i.e. give -10log_10 prob(call in ALT is wrong). If ALT is ”.” (no variant) then this is -10log_10 p(variant), and if ALT is not ”.” this is -10log_10 p(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. (Numeric)
To weed out the low confidence calls in our VCF file we need to filter by QUAL. This can be done using the bcftools
program that’s included within the samtools
suite of tools.
All these tools can run on gzip-compressed files which saves a lot of space on your computer.
gzip SRR2003569_chI_1Mb.vcf
Ok lets filter by QUAL. We can do this with the bcftools filter
or bcftools view
commands which allows you to run an expression filter. This means you can either exclude (-e
) or include (-i
) variants based on a certain criteria. In our case, lets exclude all variants that have a QUAL < 30.
bcftools filter -e 'QUAL < 30' SRR2003569_chI_1Mb.vcf.gz -Oz -o SRR2003569_chI_1Mb.sorted.q30.vcf.gz
# You can pipe to grep and wc to remove the header
# and count your remaining variants after filtering too
bcftools filter -e 'QUAL < 30' SRR2003569_chI_1Mb.sorted.vcf.gz | grep -v "^#" | wc -l
How many variants greater than QUAL 30 do you have? How about the number of heterozygous variants that have a QUAL>30?
bcftools filter -i 'QUAL>30 && GT="0/1"' SRR2003569_chI_1Mb.sorted.vcf.gz
The bcftools view
commands gives a lot of additional filtering options.
Use the bcftools view
or bcftools filter
command to count the number of:
a. SNPs
b. homozygous variants
Depth is also a common filtering characteristic that many people use to remove low confidence variants. If you have low coverage of a variant, it lowers your ability to accurately call a heterozygotic site (especially if you are confident that you sequenced the sample the an adequate depth!). Find the number of SNPs that have a depth that is equal to or greater than 30 and a quality that is greater than 30.
A common tool used for viewing alignments is IGV browser. We haven’t installed this on your VM as it really needs to run a local machine (i.e. your workstation or laptop). This can be installed by following the instructions here.
Additionally, we’ll need to copy our bam files from the VM to our local machine.
To do this we’ll need FileZilla.
Once you have this installed, connect to your VM using your IP address (without the :8787), and use your biotech7005
login.
Then you’ll be able to copy files easily between your local machine and the VM.
You’ll need to download 1) your sorted bam file; 2) the index for the sorted bam file and 3) your reference chrI.fa
.
Place these in all in the same folder & open IGV.
Once you’ve opened IGV, go to the Genomes
menu & select Load genome from file
.
Navigate to where you have chrI.fa
and load this file.
Although this isn’t the full genome, it will have everything we’ve aligned.
Now go to the File
menu and select Load from File
and navigate to your alignments.
Unfortunately you won’t see anything until you zoom in.
This is so IGV doesn’t hold the entire set of alignments in memory which would slow your computer to a stand-still.
Keep zooming in until some alignments appear then have a look around.