Day 3

Our Dataset

For the majority of today, we’ll be working with an RNA-Seq dataset. As mentioned over the last 2 days, this is part of a larger aging study from a group researching differences between brain and muscle tissue in mice and humans during their evolutionary divergence. We have 4 skeletal muscle samples and 4 cerebellar cortex samples to compare for expression differences, which you performed on day 2. Our goal here on day three is to produce an aligned dataset suitable for gene expression analysis.

Quality Control

Using FastQC

A common tool for checking the quality of a FASTQ file is the program FastQC. As with all programs on the command line, we need to see how it works before we use it. The following command will open the help file in the less pager which we used earlier. To navigate through the file, use the <spacebar> to move forward a page, <b> to move back a page & <q> to exit the manual.

fastqc --help | less

FastQC will create an html report for each file you provide, which can then be opened from any web browser such as firefox. As seen in the help page, FastQC can be run from the command line or from a graphic user interface (GUI). Using a GUI is generally intuitive so today we will look at the command line usage, as that will give you more flexibility & options going forward. Some important options for the command can be seen in the manual. As you will see in the manual, setting the -h option as above will call the help page. Look up the following options to find what they mean.

Option Usage
--outdir
--threads

The VMs we’re all using have two cores so we can set the parameter --threads 2. We can also write to the output directory that we’ve already created above. Let’s run FastQC across all of our FASTQ files and see what we get.

# Make sure we're in the correct directory
cd ~/Day_3/

# Make the FastQC output directory ahead of time
mkdir --parents ~/Day_3/0_rawData/FastQC

# Run FastQC across all the FASTQ files
fastqc \
  --threads 2 \
  --outdir ~/Day_3/0_rawData/FastQC/ \
  ~/Day_3/0_rawData/fastq/*.fastq.gz

This will have created a FastQC report, in HTML format, for every input FASTQ file. This report file can be opened and viewed using any web browser.

# Lets see what outputs FastQC generated
ls ~/Day_3/0_rawData/FastQC/

Using RStudio, navigate to the ~/Day_3/0_rawData/FastQC/ directory. Here you’ll see all the files created by running the FastQC tool. Click on one of the HTML files and choose “View in the Web Browser”. This will open the FastQC report in a new tab in your web browser.

Inspecting a FastQC Report

The left hand menu contains a series of clickable links to navigate through the results of each QC “module” contained within the report. A simple traffic light system is used to indicate when something is deviating from what might be expected as being “normal”.

The QC metrics, and their traffic light system, are based on what is expected for whole genome shotgun sequencing - not RNA-seq. As such, some metrics which appear poor at first glance, are not unexpected for RNA-seq data sets.

Questions

{:.no_toc}

  1. How many reads are reported as being present in SRR945377.skm.fastq.gz?
  2. How long are the reads in SRR945377.skm.fastq.gz?

Interpreting a FastQC Report

As we work through the QC reports we will develop a series of criteria for cleaning up our files. There is usually no perfect solution, we just have to make the best decisions we can based on the information we have. Some sections will prove more informative than others, and some will only be helpful if we are drilling deeply into our data. Firstly we’ll just look at a selection of the plots. We’ll investigate some of the others with some “bad” data later.

Per Base Sequence Quality

{:.no_toc}

Click on the Per base sequence quality hyperlink on the left of the page. You will see a boxplot of the QC score distributions for every position in the read. These are the PHRED scores we discussed earlier, and this plot is usually the first one that bioinformaticians will look at for making informed decisions about the overall quality of the data and settings for removing poor quality regions of reads.

What do you notice about the QC scores as you progress through the read?

We will deal with trimming the reads in a later section, but start to think about what you should do to the reads to ensure the highest quality in your final alignment & analysis.

As this is RNA-Seq data, we’ll mainly be quantifying the number of reads which align to a gene. Will the actual sequence content be vital for us? Now consider a whole-genome-sequencing (WGS) experiment where we are wanting to identify SNPs or other genomic variants. Would the actual sequence content be vital for us now?

Per Tile Sequence Quality
This section just gives a quick visualisation about any physical effects on sequence quality due to the tile within the each flowcell or lane. Generally, this would only be of note if drilling deeply to remove data from tiles with notable problems. Most of the time we don’t factor in spatial effects, unless alternative approaches fail to address the issues we are dealing with.
Per Sequence Quality Scores
This is just the distribution of the average quality scores for each read, obtained by averaging all the scores at each base within a read. There’s not much of note for us to see here.
Per Base Sequence Content
This will often show artefacts from barcodes or adapters early in the reads, before stabilising to show a relatively even distribution of the bases. Here FastQC may have flagged this as a ‘fail’ and you will see considerable variability in the first few bases. This is actually very normal for RNA seq and is a consequence of non-random fragmentation and non-random adapter ligation. It’s also relatively common to see a drift towards G towards the end of a read. This can be a bit more troubling (ask a tutor to explain) but is usually remedied as we trim our reads in the next step.
Sequence Length Distribution
This shows the distributions of read lengths in our data. If the length of your reads is vital (e.g. smallRNA data), then this can also be a very informative plot. For our data, it appears that some trimming has already been performed. This was done by the sequence provider, much to our disappointment. It’s actually quite common for this to happen, it’s just the bioinformaticians love to know everything about every step that was performed. It’s also not uncommon for Illumina’s adapter removal tools to leave quite a few there and you then have to trim yet again.

Sequence Duplication Levels This plot shows about what you’d expect from a typical NGS experiment. There are a few duplicated sequences (rRNA, highly expressed genes etc.) and lots of unique sequences representing the diverse transcriptome. This is only calculated on a small sample of the library for computational efficiency and is just to give a rough guide if anything unusual stands out. Things to watch for here are peaks on the far right which would indicate massive overrepresentation of a few sequences above the rest of the source material.

Overrepresented Sequences Here we can see any sequence which are more abundant than would be expected. Sometimes you’ll see sequences here that match the adapters used, or you may see highly expressed genes here.

Adapter Content This can give a good guide as to our true fragment lengths. If we have read lengths which are longer than our original DNA/RNA fragments (i.e. inserts) then the sequencing will run into the adapters. If you have used custom adapters, you may also need to supply them to FastQC as this only searches for common adapter sequences. Here, it looks like Illumina’s automated tool has a done a pretty reasonable job.

Some More Example Reports

Let’s head to another sample plot at the FastQC homepage

Per Base Sequence Quality Looking at the first plot, we can clearly see this data is not as high quality as the one we have been exploring ourselves.

Per Tile Sequence Quality Some physical artefacts are visible & some tiles seem to be consistently lower quality. Whichever approach we take to cleaning the data will more than likely account for any of these artefacts. Sometimes it’s just helpful to know where a problem has arisen.

Overrepresented Sequences Head to this section of the report & scan down the list. Unlike our sample data, there seem to be a lot of enriched sequences of unknown origin. There is one hit to an Illumina adapter sequence, so we know at least one of the contaminants in the data. Note that some of these sequences are the same as others on the list, just shifted one or two base pairs. A possible source of this may have been non-random fragmentation.

Interpreting the various sections of the report can take time & experience. A description of each of the sections is available from the fastqc authors which can be very helpful as you’re finding your way.

Another interesting report is available at http://www.bioinformatics.babraham.ac.uk/projects/fastqc/RNA-Seq_fastqc.html. Whilst the quality scores generally look pretty good for this one, see if you can find a point of interest in this data. This is a good example, of why just skimming the first plot may not be such a good idea.

Aggregating FastQC Reports

In our dataset, we have 8 samples so it’s not too onerous to inspect all 8 individually. In the real world, we’ll often have much larger datasets and looking at FastQC reports for all samples quickly becomes challenging. The Bioinformatics Hub has written and published an R package, called ngsReports, to enable aggregation of many FastQC reports into an easily digestable format. To use this, first create a directory called R in your ~/Day_3 directory. You will use this directory to store all of the R scripts which you will create today.

mkdir --parents ~/Day_3/R/

The R package ngsReports is available on Bioconductor and is preinstalled on your VM. Create a new R script and save it in your R folder as FastQC_section.R.

Add the following contents to this file:

library(ngsReports)
writeHtmlReport("~/Day_3/0_rawData/FastQC/", species="Mmusculus")

Now “Run” the file to have ngsReports process the FastQC report and generate an aggregated HTML report file.

Once this has completed, use your Files pane to navigate to your FastQC reports again & open the file ~/Day_3/0_rawData/FastQC/ngsReports_Fastqc.html using your Web Browser. This will contain a summary of all the files in our dataset. Take your time scrolling through the report, and note that each plot is interactive so you can hover over various points and see which file you are looking at.

Adapter and quality trimming of NGS data

Once we have inspected our data and have an idea of how accurate our reads are, as well as any other technical issues that may be within the data, we may need to trim or filter the reads to make sure we are aligning or analysing sequences that accurately represent our source material. As we’ve noticed, the quality of reads commonly drops off towards the end of the reads, and dealing with this behaviour can be an important part of most processing pipelines. Sometimes we will require reads of identical lengths for our downstream analysis, whilst other times we can use reads of varying lengths. The data cleaning steps we choose for our own analysis will inevitably be influenced by our downstream requirements.

The Basic Workflow

Data cleaning and pre-processing can involve many steps, and today we will use the basic work-flow as outlined below. Each analysis is slightly different so some steps may or may not be required for your own data, however many workflows do have a little overlap, and some pipelines (e.g. Stacks) may even perform some of these steps for you.

A basic workflow is:

  1. Remove Adapters and Quality Trim (AdapterRemoval)
  2. Run FastQC on trimmed reads.
  3. Splice Alignment to a reference (STAR)
  4. Post-alignment QC (picard markDuplicates, IGV)

Removal of Low Quality Bases and Adapters

Adapter removal is an important step in many sequencing projects, especially projects associated with DNA/RNA inserts which are shorter than the sequencing read lengths. A good example of this would be an experiment where the target molecule is small non-coding RNAs. As these are generally between 19-35bp, which is shorter than the shortest read length provided by Illumina sequencing machines, all reads containing a target molecule will also contain adapters. In addition, when we size select our initial fragments, we select a range of fragment sizes and some are bound to be shorter than our read length. Therefore it is important to trim adapters accurately to ensure that the genome mapping and other downstream analyses are accurate.

In the early years of NGS data, we would run multiple steps to remove both adapters, low quality bases (near the ends of reads) and reads which have overall lower quality scores. Today’s trimming algorithms have become better at removing low-quality bases at the same time as removing adapters. The tool we’ll use for this step today is AdapterRemoval.

We can trim raw data from Illumina machines using the Illumina paired-end adapters obtained from this website These are commonly used in Illumina projects.

Before we perform adapter trimming, look at the following code.

cd ~/Day_3/

# Create output directories ahead of time
mkdir -p ~/Day_3/1_trimmedData/fastq
mkdir -p ~/Day_3/1_trimmedData/log

# Run adapter removal
AdapterRemoval \
    --file1 ~/Day_3/0_rawData/fastq/SRR945375.skm.fastq.gz \
    --output1 ~/Day_3/1_trimmedData/fastq/SRR945375.skm.fastq.gz \
    --discarded ~/Day_3/1_trimmedData/fastq/SRR945375.skm.discarded.gz \
    --minlength 50 \
    --threads 2 \
    --trimns \
    --trimqualities \
    --minquality 20 \
    --gzip \
    --settings ~/Day_3/1_trimmedData/log/SRR945375.skm.fastq.gz.settings

Questions

{:.no_toc} 1. What do the options --minlength 50 and --minquality 20 specify in the above? Do you think these choices were reasonable? 2. Notice the adapter sequence wasn’t specified anywhere. Did we miss an important setting? 3. What do you expect to find in the file specified using the --discarded option? 4. What do you expect to find in the file specified using the --settings option?

Run the above code by pasting into your terminal. Did you answer the questions correctly?

The AdapterRemoval tool can be made to output information about the trimming process to a file. In the above we wrote this output to a “settings” file using the --settings option to output this to the SRR945375.skm.fastq.gz.settings file. Let’s have a look in the file to check the output.

less ~/Day_3/1_trimmedData/log/SRR945375.skm.fastq.gz.settings

As this file contains a sample of good reads, it’s not surprising that we didn’t lose many reads during this step. Notice that many reads were trimmed (Number of well aligned reads), but were still long enough and high enough quality to be retained.

Trimming using a script

In the above, we manually ran the process on an individual file and manually specified key information about where to write various output files. In the real world, this is not really a viable option and we’ll need to write a script to trim ALL of our sample FASTQ files.

Before we write this script, let’s think about what will be involved.

1 - We’ll need to provide a list of input FASTQ files to work on 2 - We’ll need to specify different output files 3 - Some parameters will be constant, whilst others will change for each file

With the above provided AdapterRemoval command as a guide, we will guide you to creating a script capable of processing all your FASTQ read files.

How do we find our files?

First, create a directory called bash to store our bash scripts:

mkdir -p ~/Day_3/bash/

Create a new file called removeAdapters.sh and save it as ~/Day_3/bash/removeAdapters.sh with the following content:

#!/bin/bash

INDIR=~/Day_3/0_rawData/fastq

# Loop over out FASTQ files
for f in ${INDIR}/*.fastq.gz
do
  # Echo what file(s) we found
  echo "Found ${f}"
done

Run the script by typing bash ~/Day_3/bash/removeAdapters.sh into the terminal.

Did you see the files you expected returned?

Script explanation

The shebang #! at the top of the script file tells the operating system which scripting language we want to use, in this case bash /bin/bash. All other lines beginning with # are comment lines and are used to self-document a script. Comments are important as they both help us to remember what we were trying to achieve and help others reading our code to understand what we were intending to do.

Next we set a variable to our input directory using INDIR=~/Day_3/0_rawData/fastq. Doing so will help our future selves (or others) to more easily update the script to process data from a different input directory by simply updating this one line.

The line for f in ${INDIR}/*.fastq.gz initialises a “for loop” to iterative over each file matched by ${INDIR}/*.fastq.gz and execute the code specified between do and done for each of the files (${f}).

We are simply using an echo command to illustrate what is happening with a for loop before we start implementing commands for processing the FASTQ files for real.

Try changing line 9 (the echo line) to the following and rerun the script.

  echo "Found $(basename ${f})"

What was the difference?

The program basename, takes a filepath/directory and remove all but the last element of the path. In this case, it strips off all the leading directories from the filepath of each FASTQ file.

Now that we’ve figured how to loop over our input FASTQ files let’s try to generate a variable which contains the path of one our intended AdapterRemoval output files.

Change your script to be the following.

#!/bin/bash

INDIR=~/Day_3/0_rawData/fastq

# Define our output parent directory
OUTDIR=~/Day_3/1_trimmedData

# Check our output directory exists before going any further
if [ ! -d "${OUTDIR}" ]; then
    # If the directory doesn't exist, exit with a message and an error
    echo "${OUTDIR} does not exist"
    exit 1
fi


# Loop over out FASTQ files
for f in ${INDIR}/*.fastq.gz
do
  # Echo what file(s) we found
  echo "Found $(basename ${f})"
  
  # Define the filepath for out trimmed reads
  OUTFILE=${OUTDIR}/fastq/$(basename ${f})
  echo "Trimmed reads will be written to ${OUTFILE}"
done

Script explanation

We have introduced a couple of new concepts in this script. The if ... fi block only executes if the condition within the square brackets [] resolves to TRUE. To check if a directory exists we use the construction if [ -d foldername ]; where -d tells the script we are checking for the existence of a directory called foldername. As we are interested in reporting if the directory doesn’t exist, we use the operator ! (NOT) so the construction evaluates to TRUE if the directory is not found if [ ! -d "${OUTDIR}" ].

Take careful note of the spaces in the square brackets as these are necessary and will easily trip up new users if they are missed.

The exit 1 command will exit out of the script without any further actions, which saves time if the input files are missing and prevents resources being wasted on later processes. The informative error message output by echo on the line above is helpful when you want to know why your script failed.

In the do ... done block we can see how the OUTFILE, or the filename to be created as output, is constructed. The variable for the output directory OUTDIR is appended with the fastq directory name, /fastq/, and the filename, provided by $(basename ${f}).

Now you’ve seen how to construct the filepath for an output file, we are close to being able to place the AdapterRemoval command within the script.

What else do you think we need to specify?

Add the following lines to your script on a new blank line immediately before our final done command. If you’re having trouble with any of this please call a tutor over too!

  DISCARDED=${OUTDIR}/fastq/$(basename ${f/%.fastq.gz/.discarded.gz})
  echo "Discarded reads will be written to ${DISCARDED}"

Note that in the above we used the bash ${string/%substring/replacement} substring replacement syntax to replace the .fastq.gz suffix with .discarded.gz.

We still need to specify the settings file.

  SETTINGS=${OUTDIR}/log/$(basename ${f/%.fastq.gz/.settings/})
  echo "AdapterRemoval settings file will be written to ${SETTINGS}"

Now we have all the information we need to run AdapterRemoval on every input file!

We’ll leave it up to you to complete the script by adding in the AdapterRemoval command inside the for loop. Don’t forget to use the variables (${f}, ${OUTFILE}, ${DISCARDED} and ${SETTINGS}) for the input and output files! If you have trouble, don’t forget to shout out for help.

Can you think of anything important we skipped over in the above?
We didn’t check for the existence of the output directories for the trimmed fastq files, or for the settings files

The complete script should complete in about 10 minutes.