As with previous week’s practicals, you will be using RStudio to interact with your VM.
See week 1’s practical to remind yourself how to connect to your VM.
First we will set up a directory for today’s practical. In general it is very worthwhile to keep all your project-specific code and data organised into a consistent location and structure. This are not essential, but is very useful and is in general good practice. If you don’t follow this step, you will be making your life immeasurably harder for the duration of this practical.
To make and enter the directory that you will be working in, run the following commands in the terminal pane.
# Setup project working directory
mkdir --parents ~/Project_3/data/
cd ~/Project_3/
# load the required software environment
conda activate assembly
# Make the directory for the reference genome
mkdir --parents ~/Project_3/data/reference/
# Get the SARS-CoV-2 reference genome
cp ~/data/genomics/COVID-19.fasta.gz ~/Project_3/data/reference/
# Make the directory for the Illumina PE reads
mkdir --parents ~/Project_3/data/illumina_pe/
# Get the subsampled Illumina PE data
cp ~/data/genomics/SRR111407{44,46,48,50}_?_*x.fastq.gz ~/Project_3/data/illumina_pe/
# Make the directory for scripts
mkdir --parents ~/Project_3/data/scripts/
# Get the scripts
cp ~/data/genomics/plot_delta.R ~/Project_3/data/scripts/
Lets see what files and directories we have under our current working directory:
tree
{44,46,48,50}
mean/do in the above cp
command while grabing the Illumina PE data?We’re going to be using some very topical data, SARS-CoV-2 data, the virus causing COVID-19. We will be looking at this data in different ways, providing insights into how bioinformaticians analyses these types of data.
The most recent SARS-CoV-2 genome assembly is NC_045512.2.
This assembly was generated from a sample collected in Dec 2019 and submitted to NCBI on 13th Jan 2020 as NC_045512.1. Subsequently, NC_045512.1 was replaced by NC_045512.2 on the 17th Jan 2020.
Public sequence data is being released via NCBI’s SARS-CoV-2 page.
We will be looking at some data released on 21st Feb 2020 for a clinical swab obtained from a confirmed case in Madison, WI, USA. For further information see: https://openresearch.labkey.com/wiki/ZEST/Ncov/page.view?name=SARS-CoV-2%20Deep%20Sequencing
We will be working with the SRR11140748
sample (Illumina data) but you are welcome to also look at any of the other 3 samples if you have time.
Here is a table of information linking to the orginal source of the data:
Description | Accession/URL | Coverage |
---|---|---|
RefSeq | NC_045512.2 | |
Illumina | ||
veroSTAT-1KO | SRR11140744 | 7,586 |
veroE6 | SRR11140746 | 5,328 |
vero76 | SRR11140748 | 6,353 |
swab | SRR11140750 | 252 |
Lets perform some de novo genome assemblies of Illumina reads.
I encourage you to explore the effects of using differing amounts of coverage of the input data on the assembly and explore the assembly of the other 3 samples.
We have made data files available for 10x
, 20x
, 40x
, 80x
and 100x
coverage of all 4 Illumina samples.
time spades.py \
--threads 2 \
-o ~/Project_3/de_novo_illumina/SRR11140748_10x_PE \
-1 ~/Project_3/data/illumina_pe/SRR11140748_1_10x.fastq.gz \
-2 ~/Project_3/data/illumina_pe/SRR11140748_2_10x.fastq.gz \
| tee ~/Project_3/de_novo_illumina/SRR11140748_10x_PE.log
tee
in the above pipeline; what did it do?Compare the assembly to the SARS-CoV-2 RefSeq assembly using MUMmer’s nucmer
tool:
# Decompress the reference genome for MUMmer
pigz -dcp2 \
< ~/Project_3/data/reference/COVID-19.fasta.gz \
> ~/Project_3/data/reference/COVID-19.fasta
# Run nucmer from MUMmer package
nucmer \
-maxmatch \
-minmatch 100 \
-mincluster 500 \
-prefix ~/Project_3/de_novo_illumina/SRR11140748_10x_PE \
~/Project_3/data/reference/COVID-19.fasta \
~/Project_3/de_novo_illumina/SRR11140748_10x_PE/contigs.fasta
This will generate a .delta
file which describes the alignments between your assembled contig(s) and the reference sequence(s).
While the file is plain text, it’s not meant for human consumption.
Instead, we need to use some other tools to visualise the information it contains.
The online tool Assemblytics can be used to generate useful plots of the information contained within a .delta
file.
Alternatively, your can use some R code to plot the information. We have provided an R script for this purpose, which can be executed as follows:
Rscript --vanilla \
~/Project_3/data/scripts/plot_delta.R \
~/Project_3/de_novo_illumina/SRR11140748_10x_PE.delta
This will create a PDF file called ~/Project_3/de_novo_illumina/SRR11140748_10x_PE.delta.pdf
SPAdes uses several k-mer lengths during the assembly process to produce a de Bruijn graph for each k-mer length and then combines the information to produce a final assembly. Using what you learned above about exploring assemblies, lets look at how an assembly is affected by our choice of k-mer length.
First, lets look at an assembly produced with a single short k-mer of length of 11:
k=11
time spades.py \
--threads 2 \
-k ${k} \
-o ~/Project_3/de_novo_illumina/SRR11140748_10x_PE-k${k} \
-1 ~/Project_3/data/illumina_pe/SRR11140748_1_10x.fastq.gz \
-2 ~/Project_3/data/illumina_pe/SRR11140748_2_10x.fastq.gz \
| tee ~/Project_3/de_novo_illumina/SRR11140748_10x_PE-k${k}.log
Next, lets look at an assembly produced with a long k-mer of length of 127:
k=127
time spades.py \
--threads 2 \
-k ${k} \
-o ~/Project_3/de_novo_illumina/SRR11140748_10x_PE-k${k} \
-1 ~/Project_3/data/illumina_pe/SRR11140748_1_10x.fastq.gz \
-2 ~/Project_3/data/illumina_pe/SRR11140748_2_10x.fastq.gz \
| tee ~/Project_3/de_novo_illumina/SRR11140748_10x_PE-k${k}.log
Now modify the above commands to produce another 2 assemblies but using the 100x
coverage data.