Day 3

Aligning the reads

Because we only have a small subset of the actual sequencing run, we should be able to run this alignment in a reasonable period of time. First we’ll return to our project folder

cd ~/Day_3

Here’s the basic commands required for aligning one of our samples.

# Create the output directories ahead of time
mkdir --parents ~/Day_3/2_alignedData/bam
mkdir --parents ~/Day_3/2_alignedData/log

# Perform the read alignment for 1 sample
STAR \
  --runThreadN 2 \
  --genomeDir ~/Day_3/genomes/Mmusculus \
  --readFilesIn ~/Day_3/1_trimmedData/fastq/SRR945375.skm.fastq.gz \
  --readFilesCommand gunzip -c \
  --outFileNamePrefix ~/Day_3/2_alignedData/bam/SRR945375.skm. \
  --outSAMtype BAM SortedByCoordinate

This should complete within a couple of minutes, so while it’s running make sure you understand all of the commands given.

Why have we specified --readFilesCommand gunzip -c
This tells STAR to read files in using this command, and enables us to leave our fastq files compressed, saving hard drive (i.e. storage) space.
What to do you think the line --outSAMtype BAM SortedByCoordinate is doing?
This is making sure we return a BAM file, instead of a SAM file, sorted by genomic position.

The resulting BAM file will have the prefix we specified on the command line but will have the suffix Aligned.sortedByCoord.out.bam added.

SAM and BAM files

Now that we have aligned the reads of a single sample, let’s look in the directory ~/Day_3/2_alignedData/bam.

ll ~/Day_3/2_alignedData/bam

Here we’ll see that BAM file we mentioned earlier and a few log files which STAR always produces. These can actually be quite handy, but we won’t really look at them much today. However, let’s move them into a log folder and have a quick squiz.

# Move the "out" files
mv \
  ~/Day_3/2_alignedData/bam/*out \
  ~/Day_3/2_alignedData/log/

# move the "tab" files
mv \
  ~/Day_3/2_alignedData/bam/*tab \
  ~/Day_3/2_alignedData/log/

# Take a look at the "final.out" file
less ~/Day_3/2_alignedData/log/SRR945375.skm.Log.final.out

As you can see, this is a relatively detailed summary of our alignments for that file. For now, let’s return to our actual alignments, as contained in the BAM file.

What is a BAM file?

BAM stands for Binary AlignMent and these are our alignments stored in binary. There is another type of file called a SAM (Sequence Alignment/Map) file which is in plain text, but SAM files can become very large and waste our precious storage resources. BAM files are the exact same files stored in binary format which enables them to be much smaller. This also has the added benefit of being faster for computers to read & perform operations on, as only humans read plain text. Computers don’t.

To look at the contents of a BAM file, we’ll need the tool samtools which is one of the most heavily utilised command-line tools in the world of bioinformatics. This tool has a series of commands which we can see if we just type samtools into our terminal. The one we’ll need at this point is samtools view, which enables us to take a BAM file and send it to STDOUT as plain text so we can read it.

samtools view ~/Day_3/2_alignedData/bam/SRR945375.skm.Aligned.sortedByCoord.out.bam \
  | head --lines 2

This will return our first two alignments in a tab-delimited format. Although the lines are many columns wide and will probably be wrapped around in your terminal, you should be able to immediately spot the sequence identifier lines which came from our original fastq files. This is sometimes referred to as the query name (or QNAME). The remainder of the fields are given below.

Field Name Meaning
1 QNAME Query template/pair NAME
2 FLAG bitwise FLAG (discussed later)
3 RNAME Reference sequence (i.e. chromosome) NAME
4 POS 1-based leftmost POSition/coordinate of clipped sequence
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR extended CIGAR string
7 MRNM Mate Reference sequence NaMe (= if same as RNAME)
8 MPOS 1-based Mate POSition
9 TLEN inferred Template LENgth (insert size)
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE

The internal representation of BAM files does not (for the most part) include tabs, but the information stored there is identical.

Several of these fields contain useful information, so looking the the first few lines which we displayed above, you can see that these reads are mapped in pairs as consecutive entries in the QNAME field are often (but not always) identical. Most of these fields are self-explanatory, but some require exploration in more detail.

SAM Flags (Field 2)

These are quite useful pieces of information, but can be difficult at first look. Head to http://broadinstitute.github.io/picard/explain-flags.html to see a helpful description. The simplest way to understand these is that it is a bitwise system so that each description heading down the page increases in a binary fashion. The first has value 1, the second has value 2, the third has value 4 and so on until you reach the final value of 2048. The integer value contained in this file is the unique sum of whichever attributes the mapping has. For example, if the read is paired and mapped in a proper pair, but no other attributes are set, the flag field would contain the value 3.

Questions

{:.no_toc}

  1. What value could a flag take if the read was 1 - paired; 2 - mapped in a proper pair; 3 - it was the first in the pair, and 4 - the alignment was a supplementary alignment.
  2. Some common values in the bam file are 16, 256 and 272. Look up the meanings of these values.

Things can easily begin to confuse people once you start searching for specific flags, but if you remember that each attribute is like an individual flag that is either on or off (i.e. it is actually a binary bit with values 0 or 1). If you searched for flags with the value 1, you wouldn’t obtain the alignments with the exact value 1, rather you would obtain the alignments for which the first flag is set and these can take a range of values.

A summary of some of the flag can be obtained using the command samtools flagstat

samtools flagstat ~/Day_3/2_alignedData/bam/SRR945375.skm.Aligned.sortedByCoord.out.bam

Questions

  1. How many reads aligned to our genome?

  2. What information does samtools stats provide that samtools flagstat does not?

CIGAR strings

These give useful information about the type of alignment that has been performed on the read. In the first few reads we called up earlier, most had the value ..M where .. is some number. These are the aligned positions. CIGAR strings from RNA-seq read alignments contain large numbers followed by N. This represents a spliced alignment, and in this context is very likely to be where part of a read aligns to one exon, whilst the other part of the read aligns to another exon, making this commonly indicative of a spliced alignment.

The other abbreviations in common use are I (insertion), D (deletion) and S (soft-clipping).

What is the interpretation of the first CIGAR string in your set of alignments?

Aligning All Samples

So far, we have only aligned a single read file. We will now write a script for aligning ALL of our read files. Let’s use a similar strategy as before remembering that the key values we need to find for each sample are:

  • The input file (${f})
  • The output prefix (maybe call this PREFIX in your script)

This is actually a bit simpler than for AdapterRemoval, but remember the strategy of building our script up one line at a time is a good approach to take. This allows us to check what we are doing as we progress before implementing our final call to STAR to do the read alignment.

See if you can figure out how to move the log files to the log directory as part of your script too!

Bonus Material

The package ngsReports is also able to import all of your STAR log files, in particular those with the suffix Log.final.out

Try the following command in your RStudio session.

library(ngsReports)
starlogs <- list.files("2_alignedData/log/", pattern = "Log.final.out", full.names = TRUE)
alnSummary <- importNgsLogs(starlogs, type = "star")
alnSummary

Note how all the information contained in the log files we inspected earlier is now in your R session as a convenient data.frame (or tibble).

Now you can easily make a summary of your alignment statistics for your thesis!

library(tidyverse)
library(pander)

alnSummary %>%
    mutate(Filename = str_remove(Filename, ".Log.final.out")) %>%
    dplyr::select(
        Filename, 
        Total_Mapped_Percent, 
        Uniquely_Mapped_Reads_Percent, 
        Percent_Of_Reads_Mapped_To_Multiple_Loci,
        Percent_Of_Reads_Mapped_To_Too_Many_Loci
    ) %>%
    pander(split.tables =Inf, caption = "Alignment Summary")
Alignment Summary
Filename Total_Mapped_Percent Uniquely_Mapped_Reads_Percent Percent_Of_Reads_Mapped_To_Multiple_Loci Percent_Of_Reads_Mapped_To_Too_Many_Loci
SRR945375.skm 99.96 98.85 1.1 0.01
SRR945375.sk 99.96 98.85 1.1 0.01
SRR945377.sk 99.96 99.13 0.83 0.01
SRR945379.sk 99.95 98.72 1.23 0.02
SRR945381.sk 99.96 99.15 0.81 0.01
SRR945419.cb 99.92 97.95 1.97 0.04
SRR945421.cb 99.93 98.42 1.51 0.03
SRR945423.cb 99.93 98.2 1.73 0.03
SRR945427.cb 99.94 98.5 1.45 0.02

These alignment files are now ready for use in creating counts files as you learned on the day 2 course.