Working with Alignments

Quality control of alignment data

A fundamental aspect of high-throughput sequencing approaches is the alignment of your input sample to a closely-related reference genome. By identifying differences, or genetic variations, between the assembled reference genome and the input sample, we are able to make inferences regarding potential genetic changes that are creating phenotypic differences. However, aligning millions, and often billions, of sequence fragments (referred to as “reads”) to a specific reference genome is an incredibly complex task, especially in light of repetitive sequences that make it difficult to place certainity on a specific location of each sequence read.

Of course, what happens when we sequence RNA? RNA fragment differ from DNA in that it only contains coding regions of the genome. This means we will only see the exons of a gene, with the introns being spliced out after transcription. If we are aligning to a genome made of DNA, rather than a transcriptome made of RNA, then we need the ability to align to split sequences across intron-exon boundaries.

So it is extremely important that we have a way of detailing where each read aligned, as well as information as to how well it aligned. In this session we’re going to get into how we store alignments

Data setup

Because we’re working backwards, we are assuming that we have already aligned our RNA-seq samples to a reference genome.

We’ll be working today in the folder ~/Day_2. Lets change into that directory:

cd ~/Day_2

Please ensure you use this exact path as any variations will undoubtedly cause you problems and lead to unnecessary confusion.

Now we have our base directory setup for the day, we can place our data in the directory ~/Day_2/data. As you may have seen yesterday, the data that we need is already available in the ~/data directory in our home area. Lets copy all the contents of that directory to our new

mkdir -p ~/Day_2/data
cp -r ~/data/Day_2/* ~/Day_2/data/
ls ~/Day_2/data/

There should be one directory contained there, called 2_alignedData. This contains all the data that (funnily enough) we will be creating tomorrow, but for which we need today. Within 2_alignedData we had all the inputs that we need to look at alignments and count data.

ls ~/Day_2/data/2_alignedData

Let’s get stuck into learning about alignment files.

What is a SAM file and what’s a BAM file?

SAM stands for Sequence Alignment/Map, a text file format that was developed to store sequence alignment data. Because SAM files are textual, they can become very large and waste our precious storage resources. To get around this problem a binary equivalent of SAM was also developed (BAM) where alignments stored in binary. BAM files store the same data as SAM files, but in a compressed binary format which enables them to be much smaller. This also has the added benefit of being faster for computers to read and perform operations on, as only humans read plain text. Computers don’t.

To look at the contents of a BAM file, we’ll need the tool samtools which is one of the most heavily utilised command-line tools in the world of bioinformatics. samtools is used for a number of tasks in high-throughput sequencing analysis, from viewing and converting alignment formats, to generating reports and statistics. This tool as a series of sub-commands which we can see if we just type samtools into our terminal.

Program: samtools (Tools for alignments in the SAM format)
Version: 1.9 (using htslib 1.9)

Usage:   samtools <command> [options]

Commands:
  -- Indexing
     dict           create a sequence dictionary file
     faidx          index/extract FASTA
     fqidx          index/extract FASTQ
     index          index alignment

  -- Editing
     calmd          recalculate MD/NM tags and '=' bases
     fixmate        fix mate information
     reheader       replace BAM header
     targetcut      cut fosmid regions (for fosmid pool only)
     addreplacerg   adds or replaces RG tags
     markdup        mark duplicates

  -- File operations
     collate        shuffle and group alignments by name
     cat            concatenate BAMs
     merge          merge sorted alignments
     mpileup        multi-way pileup
     sort           sort alignment file
     split          splits a file by read group
     quickcheck     quickly check if SAM/BAM/CRAM file appears intact
     fastq          converts a BAM to a FASTQ
     fasta          converts a BAM to a FASTA

  -- Statistics
     bedcov         read depth per BED region
     depth          compute the depth
     flagstat       simple stats
     idxstats       BAM index stats
     phase          phase heterozygotes
     stats          generate stats (former bamcheck)

  -- Viewing
     flags          explain BAM flags
     tview          text alignment viewer
     view           SAM<->BAM<->CRAM conversion
     depad          convert padded BAM to unpadded BAM

As you can see, each sub-command points to specific tasks that you can do with the program. The one we’ll need at this point is samtools view, which enables us to take a BAM file and send it to the screen as plain text so we can read it. Remember, we can use the pipe | to direct the screen to a function call head, which gives the first lines of the file.

cd ~/Day_2/data/2_alignedData/bam
samtools view SRR945375.skm.trimmedAligned.sortedByCoord.out.bam | head

As you can see, there is a tonne of information here set out in columns which are all tab separated. These contain information about the where and how the read aligned to the reference genome. The specific information of each field is contained below:

Field Name Meaning
1 QNAME Query template/pair NAME
2 FLAG bitwise FLAG (discussed later)
3 RNAME Reference sequence (i.e. chromosome) NAME
4 POS 1-based leftmost POSition/coordinate of clipped sequence
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR extended CIGAR string
7 MRNM Mate Reference sequence NaMe (= if same as RNAME)
8 MPOS 1-based Mate POSition
9 TLEN inferred Template LENgth (insert size)
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
12 OPT variable OPTional fields in the format TAG:VTYPE:VALUE

Notice that each read is considered to be a query in the above descriptions, as we a querying the genome to find out where it probably came from.

Several of these fields contain useful information, so looking the the first few lines, you can see that these reads are mapped in pairs as consecutive entries in the QNAME field are often (but not always) identical. Most of these fields are self-explanatory, but some require exploration in more detail. Note that in the following command, each line from the file may wrap around several lines in your terminal. If this is confusing, just select the first read only by adding the option -n1 after your call to head

samtools view SRR945375.skm.trimmedAligned.sortedByCoord.out.bam | head -n1

While its not actually output in BAM files, an alignment file has a lot of extra information that is commonly printed at the start of a file. The samtools view command actually hides this from you by default, but it contains some very important information such as the names of all the reference genome sequences, whether the file is sorted or not, and the command that was used to create the file in the first place. To see this info we need to add the -h flag.

samtools view -h SRR945375.skm.trimmedAligned.sortedByCoord.out.bam | head 

This is what you should see:

@HD     VN:1.4  SO:coordinate
@SQ     SN:1    LN:195471971
...
...
@PG     ID:STAR PN:STAR VN:STAR_2.5.3a  CL:STAR   --runThreadN 12   --genomeDir /data/biorefs/reference_genomes/ensembl-release-96/mus_musculus/star   --readFilesIn /data/biohub/2019_SIB/1_trimmedData/fastq/SRR945375.skm.trimmed.fastq.gz      --readFilesCommand gunzip   -c      --outFileNamePrefix /data/biohub/2019_SIB/2_alignedData/bam/SRR945375.skm.trimmed   --outSAMtype BAM   SortedByCoordinate
@CO     user command line: STAR --runThreadN 12 --genomeDir /data/biorefs/reference_genomes/ensembl-release-96/mus_musculus/star --readFilesIn /data/biohub/2019_SIB/1_trimmedData/fastq/SRR945375.skm.trimmed.fastq.gz --readFilesCommand gunzip -c --outFileNamePrefix /data/biohub/2019_SIB/2_alignedData/bam/SRR945375.skm.trimmed --outSAMtype BAM SortedByCoordinate

Questions

{:.no_toc}

  1. What program and version of the program was used to align the genome?
  2. How many lines does the head of the SRR945375.skm.trimmedAligned.sortedByCoord.out.bam file contain?

Filtering a BAM file

As you have seen, the information contained in the BAM file is very comprehensive and so its often preferrable to filter the original BAM file to only include the relevant information. The alignment algorithm used to create the BAM file will calculate a lot of quality values that can used for filtering, and today we’ll look at mapping quality (MAQ) and CIGAR strings.

Mapping Quality (MAPQ)

Let’s run head on one of our alignments files again, this time without the header information at the top.

samtools view SRR945375.skm.trimmedAligned.sortedByCoord.out.bam | head 

Hopefully the 3rd and 4th fields are self explanatory, as these can be generally interpreted as the chromosome and position in the reference where the sequence aligned. However, the 5th field contains the MAPQ score which indicates how well the read aligned, and how unique each alignment is. How this value is calculated can often differ between alignment tools, but primarily the a higher score indicates a better, more unique alignment.

You will find out tomorrow that quality scores in sequencing technologies, as well as mapping qualities, are projected onto the same quality scale called a “Phred Score”. On sequencing machines, a Phred score is a probability of error for each individual base pair that was sequenced and therefore is an assigned level of accuracy. In fact, if you look in the second last field (11th) you will see the ASCII-33 Phred base quality scores. On the latest sequencing machines that produce FASTQ files (Illumina 1.8+), these scores go from 0 - 41. The table below defines the error profile of Phred Quality scores as you go higher on the scale.

Phred quality scores are logarithmically linked to error probabilities
Phred Quality Score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%

The higher you go the smaller the probability that you’ve sequenced a error. As you can see from that 11th field though, we dont see numbers but rather individual characters such as ’<9?30 quality values.

CIGAR strings

In the 6th field of the BAM file, we see another character and number code that gives useful information about the type of alignment that has been performed on the read. In the first few reads we called up earlier, most had the value ..M where .. is some number. These are the perfect Matches, where the sequence has aligned exactly. The other abbreviations in common use are I (insertion), D (deletion) and S (soft clipped).

SAM Flags

SAM flags are found in the second field of the BAM file and are quite useful pieces of information, however they can be difficult at first look. We won’t go into these today as a lot of these flags concern paired-end read sequencing and today we have single-end sequencing data. However the flags can indicate a lot of information that can be used to filter the BAM file, much like you did with mapping alignment.

Head to http://broadinstitute.github.io/picard/explain-flags.html to see a helpful description of each flag. For example, if we were to identify reads that mapped to the reverse strand, we can use the SAM flag 16. Then we can use the -f parameter to filter our BAM file to only include those reads.

samtools view -f 16 SRR945375.skm.trimmedAligned.sortedByCoord.out.bam | head

You can also do the exact opposite, i.e. identify all reads that are not on the reverse strand, by using the -F parameter.

samtools view -F 16 SRR945375.skm.trimmedAligned.sortedByCoord.out.bam | head

Going through a lot of these SAM flags one by one would be fairly tedious, so samtools has a subcommand called flagstat which counts the number of reads in specific flags.

samtools flagstat SRR945375.skm.trimmedAligned.sortedByCoord.out.bam

Questions

{:.no_toc}

  1. Using the BAM file “SRR945375.skm.trimmedAligned.sortedByCoord.out.bam”, how many reads differ between -F 16 and -f 16?
  2. How many reads were mapped to each file and how many had a MAPQ greater than 30?
  3. How many lines does the header of the SRR945375.skm.trimmedAligned.sortedByCoord.out.bam file contain? Do all the BAM file contain the same number of lines?

Assessment of alignment rate and multi-mapping

Now that we know what’s in a BAM file, how do we assess the quality of the alignment process. Generally, this is by looking at two metrics:

  • how many reads have aligned?
  • how many reads multi-mapped?

If you noticed when you were viewing the BAM file previously, there were a few alignments at the top of the file which had the same read name in field 1. Lets have a look again by running head and samtools

samtools view SRR945375.skm.trimmedAligned.sortedByCoord.out.bam | head 

In this particular BAM file, the query read SRR945375.4599518 in the first column is found multiple times. Lets see if thats just a one-off, or whether its a regular occurance in the file

samtools view -F 512 SRR945375.skm.trimmedAligned.sortedByCoord.out.bam | grep "SRR945375.4599518"

The primary goal of genome sequence alignment is where you identify the exact position of each read on the reference genome, however it is often the case that a read can map to multiple locations, termed “multi-mapped reads”. In the SAM specifications you can find the full definition of multi-mapping:

Note If you’ve ever found yourself in need of an exciting read, feel free to look through the full specifications of a SAM/BAM file (https://samtools.github.io/hts-specs/SAMv1.pdf). I suggest a good bottle of red wine.

Multiple mapping: The correct placement of a read may be ambiguous, e.g., due to repeats. In this case,
there may be multiple read alignments for the same read. One of these alignments is considered
primary. All the other alignments have the secondary alignment flag set in the SAM records that
represent them. All the SAM records have the same QNAME and the same values for 0x40 and 0x80
flags. Typically the alignment designated primary is the best alignment, but the decision may be
arbitrary.

So each read has a primary alignment (i.e. region which the read aligned to which is usually the best), and then any subsequent alignment is designated the secondary alignment. From our multi-mapped “SRR945375.4599518” reads, we have:

SRR945375.4599518       272     1       3621352 0       15M3S   *       0       0       TGGCATCTACGGTTCAAA      @CA4A2<+AABB=A=111      NH:i:10 HI:i:5   AS:i:14 nM:i:0
SRR945375.4599518       272     1       3621352 0       15M3S   *       0       0       TGGCATCTACGGTTCAAA      @CA4A2<+AABB=A=111      NH:i:10 HI:i:5   AS:i:14 nM:i:0
SRR945375.4599518       16      1       59216019        0       3S15M   *       0       0       TGGCATCTACGGTTCAAA      @CA4A2<+AABB=A=111     NH:i:10  HI:i:1  AS:i:14 nM:i:0
SRR945375.4599518       16      1       59216019        0       3S15M   *       0       0       TGGCATCTACGGTTCAAA      @CA4A2<+AABB=A=111     NH:i:10  HI:i:1  AS:i:14 nM:i:0
SRR945375.4599518       256     18      85622164        0       3S15M   *       0       0       TTTGAACCGTAGATGCCA      111=A=BBAA+<2A4AC@     NH:i:10  HI:i:8  AS:i:14 nM:i:0
...
...

Multi-mapped reads are problematic because we are not confident in their position in the genome, and therefore have a high probability of being erroneous. And this is reflective in the MAPQ score as well. As you learnt previously, your MAPQ score is the probability of the read is incorrectly mapped, or more importantly, the probability that the read maps uniquely. So if you look at the 5th column of the specific “SRR945375.4599518” reads, you see that they are all have a MAPQ value of zero. However if you look at the SAM flag field, the reads contain different flags depending on their location.

Additionally, you can check to see how many mappings a read has using the NH: tag in the last (12th) field of the line. That field is optional but many aligners, including the one that we have used i.e. STAR, add the NH tag. From the STAR manual ():

The number of loci Nmap a read maps to is given by NH:i: field. Value of 1 corresponds to
unique mappers, while values >1 corresponds to multi-mappers. HI attrbiutes enumerates multiple
alignments of a read starting with 1.

Tasks

Using both samtools and other unix tools, identify:

  1. The unique mapped reads in each BAM file
  2. The unique mapped reads that have a MAPQ > 30 in each BAM file
  3. The number of alignments found on Chromosome 1 in each BAM file

To deduplicate or not to deduplicate? That is the question!

During the sequencing library process, DNA or cDNA fragments are amplified in a Polymerase Chain Reaction (PCR) using specific adapters and primers. If the initial unique targets are saturated during this process you can lead to a scenario where replicated fragments are amplified, leading to what we refer to as “Library Duplicates” or “PCR Duplicates”. Additionally, the Illumina machine itself can introduce another duplication effect called an “Optical Duplicate” which is a result of the position of a sequencing cluster on the sequencing flowcell. A few programs exist that can identify these duplicates and remove them from the final BAM file. To do this, these programs identify sets of reads pairs that have the same unclipped alignment start and unclipped alignment end (i.e. the reads start and end at the same alignment position).

Duplicates can be particularly problematic in variant calling where it violates the assumptions of variant identification and has the potential to over-inflate sequencing error. However, the catch is that unless you have paired-read sequences (which we do not), its very difficult for you to know that your read is a duplicate in single-end reads because you don’t sequence the other end of the fragment.

To add more caveats, count-based sequencing approaches such as ChIP-seq and RNA-seq are generally prone to having high-coverage areas (especially if you have deep sequencing) which may look like duplcates. Some small non-coding RNAs are also short, so its very likely to have similar alignment starts and ends. Additionally, if you have an experiment that involves a low initial input of DNA or RNA, you’re likely to get a high level of PCR duplicates anyway!

So do we remove duplicates for these approaches?

The answer is:

Generally for RNA sequencing or ChIP-seq experiments, we will run both raw and duplicate reads through the same pipeline to compare the results to make sure our duplicates are not effecting our final outcome.

For today however, we are going to skip calling duplicates, but you can use either use samtools or picard tools to remove them, such as the code below:

# Remove duplicates the samtools way
samtools rmdup [SORTED BAM] [SORTED RMDUP BAM]

# Remove duplicates the picard way (which uses Java)
java -jar /path/to/picard/tools/picard.jar MarkDuplicates I=[SORTED BAM] O=[SORTED RMDUP BAM] M=dups.metrics.txt REMOVE_DUPLICATES=true