As with previous week’s practicals, you will be using RStudio to interact with your VM.
See week 1’s practical to remind yourself how to connect to your VM.
First we will set up a directory for today’s practical. In general it is very worthwhile to keep all your project-specific code and data organised into a consistent location and structure. This are not essential, but is very useful and is in general good practice. If you don’t follow this step, you will be making your life immeasurably harder for the duration of this practical.
To make and enter the directory that you will be working in, run the following commands in the terminal pane.
# Setup project working directory
mkdir --parents ~/Project_4/data/
cd ~/Project_4/
# load the required software environment
conda activate assembly
# Make the directory for the reference genome
mkdir --parents ~/Project_4/data/reference/
# Make subdirectories for the various data sets
mkdir --parents ~/Project_4/data/{illumina_pe,pacbio}/
# Get the data
#####
# RefSeq E. coli K-12 substr. MG1655
cp --link ~/data/genomics/NC_000913.3.fasta.gz ~/Project_4/data/reference/
# Illumina PE
cp --link ~/data/genomics/36_ACGCACCT-GGTGAAGG_L002_R?_001_40x.fastq.gz ~/Project_4/data/illumina_pe/
# PacBio
cp --link ~/data/genomics/lima.bc1106--bc1106_40x.subreadset.fastq.gz ~/Project_4/data/pacbio/
# Nanopore
# TODO
--link
argument to cp
do? Hint: Use the man
page and google to work it out.--link
be useful with genomics data files?Using FastQC, check the Illumina data and determine if you need to perform read trimming.
fastqc \
--threads 2 \
data/illumina_pe/*.fastq.gz
If you deem it necessary to quality trim, adapter trim or length filter your raw reads then look back at last week’s code and use either Trimmomatic or fastp to perform the trimming/filtering.
Once you’re happy your reads are good to align to the reference genome
# Index the reference genome for use with BWA
bwa index data/reference/NC_000913.3.fasta.gz
# Create an output directory for read mappings
mkdir --parents mappings/
# Align reads
bwa mem \
-t 2 \
-T 30 \
data/reference/NC_000913.3.fasta.gz \
data/illumina_pe/36_ACGCACCT-GGTGAAGG_L002_R1_001_40x.fastq.gz \
data/illumina_pe/36_ACGCACCT-GGTGAAGG_L002_R2_001_40x.fastq.gz \
| samtools view -F 4 -u \
| samtools sort \
--threads 2 -l 7 \
-o mappings/36_ACGCACCT-GGTGAAGG_L002_40x_T30.bam \
/dev/stdin
-T
argument to bwa mem
do and what default value does bwa mem
use?-F 4
argument to samtools view
do?-u
argument to samtools view
do and why is this beneficial in the middle of a pipeline?-l 7
argument to samtools sort
do?# Index the reference genome for minimap2
minimap2 \
-d data/reference/NC_000913.3.fasta.gz.mmi \
data/reference/NC_000913.3.fasta.gz
# Align the reads
time minimap2 \
-ax map-pb \
-t 2 \
data/reference/NC_000913.3.fasta.gz \
data/pacbio/lima.bc1106--bc1106_40x.subreadset.fastq.gz \
| samtools view -F 4 -u \
| samtools sort \
--threads 2 -l 7 \
-o mappings/lima.bc1106--bc1106_40x.bam
Today, we will try to download and run IGV on your local computer. Head to the IGV download page and grab the version for your operating system. If you don’t have Java or you have troubles running it, try using IGV-web instead.
NOTE: If you’re using IGV-web, you will need to decompress the genome sequence FASTA file.
Also, don’t forget to index the following files:
Once you have done this, load the E. coli K-12 genome and then load the BAM files for both the Illumina and PacBio data.
What do you make of these regions:
NC_000913.3:276291-294244
NC_000913.3:4295777-4296810
In the above bwa mem
alignment step, we used the default value (30
) for -T
.
I asked you what the option does.
Now I want you to explore the effect of increasing this value on the number of mismatches observed in the read alignments when you load the resulting BAM files into IGV.