Now that we’ve run AdapterRemoval on our complete set of FASTQ files, we’ll need to check what impact it’s had on the quality of our sequencing library files.
Add the following line the the removeAdapters.sh file and rerun it. Alternatively, run this fastqc command directly in the terminal.
# Create the FastQC output dir ahead of time
mkdir ~/Day_3/1_trimmedData/FastQC
# Run FastQC
fastqc \
--threads 2 \
--outdir ~/Day_3/1_trimmedData/FastQC/ \
~/Day_3/1_trimmedData/fastq/*.fastq.gz
Once this completes, use ngsReports to aggregate all the FastQC reports for the trimmed FASTQ files:
library(ngsReports)
writeHtmlReport("~/Day_3/1_trimmedData/FastQC/", species="Mmusculus")
Compare this aggregated report for the trimmed data with that generated for the raw data, are you happy with the improvements to the data?
Once we have cleaned our data of any contaminating adapter sequences and removed low quality bases which might affect our downstream analyses, we can more confidently move onto aligning the reads to a reference.
For the purposes of this workshop we will perform an alignment of our RNA-seq reads to the genome. This requires using a tool capable of performing “spliced alignment”, the ability to align the two ends of a read across an intron.
Modern read alignment software tools require the reference genome to be “indexed”. Much like a telephone book index, a reference genome index facilitates the rapid lookup of information from the reference genome. More specifically, we can rapidly identify all the locations in the genome which have a perfect match to a short number of bases (a k-mer of length k base-pairs) extracted from a read. These positions identified in the genome are used to “seed” a more computationally intensive step of looking for an complete alignment of the whole read to the regions where the seed(s) were identified.
Some key differences between aligners is in the way they index the genome, and in the a way they are equipped to handle mismatches and indels (insertions and deletions). Choosing an aligner can be a difficult decision with the differences often seeming quite subtle, but with significant impacts. Sometimes there is a best choice, other times there really isn’t. Make sure you’ve researched relatively thoroughly before deciding which to use.
Here is a selection of commonly used aligners.
| aligner | query | lengths |
|---|---|---|
| bwa | DNA/RNA | short read |
| minimap2 | DNA/RNA | long read |
| STAR | RNA | short read |
| kallisto | RNA | short read |
| HISAT2 | RNA/DNA | short read |
To align any reads, we first need to download the appropriate (i.e. latest) genome and then we can build the index to enable fast searching via the Burrows-Wheeler Transform. Like we’ve seen in the previous sections, our reads today come from mice (Mus musculus).
Let’s download the reference, but first we’ll need to create a directory to hold it. As this may be useful in multiple experiments, it makes sense to place it in a directory outside our current project.
mkdir --parents ~/Day_3/genomes/Mmusculus
cd ~/Day_3/genomes/Mmusculus
We have only provided you with reads which align to chromosome 1. This is so that compute processing times are fast enough to be completed in a reasonable amount of time. Normally, you would align to the whole genome.
Use curl to download the M. musculus chromosome 1 file via the terminal:
curl "ftp://ftp.ensembl.org/pub/release-96/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.1.fa.gz" \
--output ~/Day_3/genomes/Mmusculus/Mus_musculus.GRCm38.dna.chromosome.1.fa.gz
The curl command above copies a file from an ftp server. You need to specify exactly where you want the file to go and what to call it with the --output argument.
# Have a look at the first few lines
zcat ~/Day_3/genomes/Mmusculus/Mus_musculus.GRCm38.dna.chromosome.1.fa.gz | head
Note that the first line describes the following sequence and begins with a > symbol and the rest contains all the sequence information. This assembly has an awful lot of unknown nucleotides, represented with NNNNN at the beginning of the chromosome, which is fairly common as these regions are usually quite difficult to sequence properly due to very long stretches of duplicated nucleotides. The N values act as placeholders for later assemblies to fill in.
If you would like to see where the sequenced part of Chromosome 1 begins in this assembly, you can use zgrep to search for any line beginning with A,C,G or T.
zgrep -n1 "^[ACTG]" ~/Day_3/genomes/Mmusculus/Mus_musculus.GRCm38.dna.chromosome.1.fa.gz | head
The -n1 argument tells zgrep that we also want to see the line before the one on which the match is found. We are also piping the results to head so we only see 10 lines of interest from this area.
As mentioned above read aligners achieve their performance by constructing an index of the reference genome so that initial seed locations can be rapidly found for alignment extension. The process of constructing an index can take a significant amount of time, although only needs to be performed once for each genome being used and so becomes less of a burden if you’re reusing the same genome over a period of time.
Unfortunately, when building an index, STAR (which we’ll be using today) needs an extracted fasta file, so we’ll have to extract this file now
gunzip Mus_musculus.GRCm38.dna.chromosome.1.fa.gz
Index the decompressed chromosome 1 FASTA file:
STAR \
--runMode genomeGenerate \
--runThreadN 2 \
--genomeDir ~/Day_3/genomes/Mmusculus \
--genomeSAindexNbases 12 \
--genomeFastaFiles \
~/Day_3/genomes/Mmusculus/Mus_musculus.GRCm38.dna.chromosome.1.fa
With this complete, you are now ready to align reads to the indexed genome.
In the event that the indexing fails or takes too long, copy our precomputed index files:
cp --recursive \
~/data/Day_3/genomes/Mmusculus \
~/Day_3/genomes/