Because we only have a small subset of the actual sequencing run, we should be able to run this alignment in a reasonable period of time. First we’ll return to our project folder
cd ~/Day_3
Here’s the basic commands required for aligning one of our samples.
# Create the output directories ahead of time
mkdir --parents ~/Day_3/2_alignedData/bam
mkdir --parents ~/Day_3/2_alignedData/log
# Perform the read alignment for 1 sample
STAR \
--runThreadN 2 \
--genomeDir ~/Day_3/genomes/Mmusculus \
--readFilesIn ~/Day_3/1_trimmedData/fastq/SRR945375.skm.fastq.gz \
--readFilesCommand gunzip -c \
--outFileNamePrefix ~/Day_3/2_alignedData/bam/SRR945375.skm. \
--outSAMtype BAM SortedByCoordinate
This should complete within a couple of minutes, so while it’s running make sure you understand all of the commands given.
Why have we specified--readFilesCommand gunzip -c
--outSAMtype BAM SortedByCoordinate is doing?
The resulting BAM file will have the prefix we specified on the command line but will have the suffix Aligned.sortedByCoord.out.bam added.
Now that we have aligned the reads of a single sample, let’s look in the directory ~/Day_3/2_alignedData/bam.
ll ~/Day_3/2_alignedData/bam
Here we’ll see that BAM file we mentioned earlier and a few log files which STAR always produces. These can actually be quite handy, but we won’t really look at them much today. However, let’s move them into a log folder and have a quick squiz.
# Move the "out" files
mv \
~/Day_3/2_alignedData/bam/*out \
~/Day_3/2_alignedData/log/
# move the "tab" files
mv \
~/Day_3/2_alignedData/bam/*tab \
~/Day_3/2_alignedData/log/
# Take a look at the "final.out" file
less ~/Day_3/2_alignedData/log/SRR945375.skm.Log.final.out
As you can see, this is a relatively detailed summary of our alignments for that file. For now, let’s return to our actual alignments, as contained in the BAM file.
BAM stands for Binary AlignMent and these are our alignments stored in binary. There is another type of file called a SAM (Sequence Alignment/Map) file which is in plain text, but SAM files can become very large and waste our precious storage resources. BAM files are the exact same files stored in binary format which enables them to be much smaller. This also has the added benefit of being faster for computers to read & perform operations on, as only humans read plain text. Computers don’t.
To look at the contents of a BAM file, we’ll need the tool samtools which is one of the most heavily utilised command-line tools in the world of bioinformatics. This tool has a series of commands which we can see if we just type samtools into our terminal. The one we’ll need at this point is samtools view, which enables us to take a BAM file and send it to STDOUT as plain text so we can read it.
samtools view ~/Day_3/2_alignedData/bam/SRR945375.skm.Aligned.sortedByCoord.out.bam \
| head --lines 2
This will return our first two alignments in a tab-delimited format. Although the lines are many columns wide and will probably be wrapped around in your terminal, you should be able to immediately spot the sequence identifier lines which came from our original fastq files. This is sometimes referred to as the query name (or QNAME). The remainder of the fields are given below.
| Field | Name | Meaning |
|---|---|---|
| 1 | QNAME | Query template/pair NAME |
| 2 | FLAG | bitwise FLAG (discussed later) |
| 3 | RNAME | Reference sequence (i.e. chromosome) NAME |
| 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
| 5 | MAPQ | MAPping Quality (Phred-scaled) |
| 6 | CIGAR | extended CIGAR string |
| 7 | MRNM | Mate Reference sequence NaMe (= if same as RNAME) |
| 8 | MPOS | 1-based Mate POSition |
| 9 | TLEN | inferred Template LENgth (insert size) |
| 10 | SEQ | query SEQuence on the same strand as the reference |
| 11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
| 12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
The internal representation of BAM files does not (for the most part) include tabs, but the information stored there is identical.
Several of these fields contain useful information, so looking the the first few lines which we displayed above, you can see that these reads are mapped in pairs as consecutive entries in the QNAME field are often (but not always) identical. Most of these fields are self-explanatory, but some require exploration in more detail.
These are quite useful pieces of information, but can be difficult at first look. Head to http://broadinstitute.github.io/picard/explain-flags.html to see a helpful description. The simplest way to understand these is that it is a bitwise system so that each description heading down the page increases in a binary fashion. The first has value 1, the second has value 2, the third has value 4 and so on until you reach the final value of 2048. The integer value contained in this file is the unique sum of whichever attributes the mapping has. For example, if the read is paired and mapped in a proper pair, but no other attributes are set, the flag field would contain the value 3.
{:.no_toc}
Things can easily begin to confuse people once you start searching for specific flags, but if you remember that each attribute is like an individual flag that is either on or off (i.e. it is actually a binary bit with values 0 or 1). If you searched for flags with the value 1, you wouldn’t obtain the alignments with the exact value 1, rather you would obtain the alignments for which the first flag is set and these can take a range of values.
A summary of some of the flag can be obtained using the command samtools flagstat
samtools flagstat ~/Day_3/2_alignedData/bam/SRR945375.skm.Aligned.sortedByCoord.out.bam
How many reads aligned to our genome?
What information does samtools stats provide that samtools flagstat does not?
These give useful information about the type of alignment that has been performed on the read. In the first few reads we called up earlier, most had the value ..M where .. is some number. These are the aligned positions. CIGAR strings from RNA-seq read alignments contain large numbers followed by N. This represents a spliced alignment, and in this context is very likely to be where part of a read aligns to one exon, whilst the other part of the read aligns to another exon, making this commonly indicative of a spliced alignment.
The other abbreviations in common use are I (insertion), D (deletion) and S (soft-clipping).
What is the interpretation of the first CIGAR string in your set of alignments?
So far, we have only aligned a single read file. We will now write a script for aligning ALL of our read files. Let’s use a similar strategy as before remembering that the key values we need to find for each sample are:
${f})PREFIX in your script)This is actually a bit simpler than for AdapterRemoval, but remember the strategy of building our script up one line at a time is a good approach to take. This allows us to check what we are doing as we progress before implementing our final call to STAR to do the read alignment.
See if you can figure out how to move the log files to the log directory as part of your script too!
The package ngsReports is also able to import all of your STAR log files, in particular those with the suffix Log.final.out
Try the following command in your RStudio session.
library(ngsReports)
starlogs <- list.files("2_alignedData/log/", pattern = "Log.final.out", full.names = TRUE)
alnSummary <- importNgsLogs(starlogs, type = "star")
alnSummary
Note how all the information contained in the log files we inspected earlier is now in your R session as a convenient data.frame (or tibble).
Now you can easily make a summary of your alignment statistics for your thesis!
library(tidyverse)
library(pander)
alnSummary %>%
mutate(Filename = str_remove(Filename, ".Log.final.out")) %>%
dplyr::select(
Filename,
Total_Mapped_Percent,
Uniquely_Mapped_Reads_Percent,
Percent_Of_Reads_Mapped_To_Multiple_Loci,
Percent_Of_Reads_Mapped_To_Too_Many_Loci
) %>%
pander(split.tables =Inf, caption = "Alignment Summary")
| Filename | Total_Mapped_Percent | Uniquely_Mapped_Reads_Percent | Percent_Of_Reads_Mapped_To_Multiple_Loci | Percent_Of_Reads_Mapped_To_Too_Many_Loci |
|---|---|---|---|---|
| SRR945375.skm | 99.96 | 98.85 | 1.1 | 0.01 |
| SRR945375.sk | 99.96 | 98.85 | 1.1 | 0.01 |
| SRR945377.sk | 99.96 | 99.13 | 0.83 | 0.01 |
| SRR945379.sk | 99.95 | 98.72 | 1.23 | 0.02 |
| SRR945381.sk | 99.96 | 99.15 | 0.81 | 0.01 |
| SRR945419.cb | 99.92 | 97.95 | 1.97 | 0.04 |
| SRR945421.cb | 99.93 | 98.42 | 1.51 | 0.03 |
| SRR945423.cb | 99.93 | 98.2 | 1.73 | 0.03 |
| SRR945427.cb | 99.94 | 98.5 | 1.45 | 0.02 |
These alignment files are now ready for use in creating counts files as you learned on the day 2 course.