Day 3

Welcome

Welcome to Spring Into Bioinformatics for 2019. Over this 3 day course we’ll hopefully cover enough concepts to get you started with your data and analyses. This course will provide the most benefit if you continue to use the skills in the weeks directly after the course, and is aimed at those with minimal to no prior bioinformatics expertise. Course material will be available at this URL indefinitely.

Most of the sessions will be self-guided, with key direction provided sporadically at important times. Please ask as many questions as you need. The tutors are specifically here to help you understand and develop your skills, so please ensure you take full advantage of their availability.

We strongly encourage you to a) read all of the notes, and b) manually type all of the code (unless directed otherwise). This will provide you with the the most benefit.

Tutors

This course was primarily written by members of the Bioinformatics Hub and the tutors across the three days will be:

Nathan Watson-Haigh, Dan Kortschak and Mark Armstrong (Bioinformatics Hub Staff)
Jimmy Breen (Bioinformatics Hub / SAHMRI)
Nhi Hin (PhD Candidate, Bioinformatics Hub)
Melanie Smith (PhD Candidate, Robinson Research Institute)
Terry Bertozzi (SA Museum)

Working with NGS Data, Data Quality Checks and Data Alignment

Continuing on with the format of this course this last day is where we are going to learn the initial steps needed to process high-throughput sequencing data. In Day 1 we learnt how to analyse and organise data in R, use important data processing packages contained in the tidyverse ecosystem, and perform plots that you would commonly create for Transcriptome sequencing projects (e.g. Volcano plot, Mean-difference plot etc). In Day 2 we learnt how to quantify aligned data and create a gene counts table, run common functions to assess sample and project quality (unsupervised clustering and estimation of library sizes) and perform differential expression analyses.

Next-generation sequencing (NGS) has become an important tool in assessing biological signal within an organism or population. Stemming from previous technologies that were costly and time-consuming to run, NGS platforms are relatively cheap and enable the investigation of the genome, transcriptome, methylome etc at extremely high resolution. The high-throughput of these machines also has unique challenges, and it is important that scientists are aware of the potential limitations of the platforms and the issues involved with the production of good quality data.

On this third day, we will introduce the key file types used in genomic analyses, illustrate the fundamentals of Illumina NGS technology (the current market leader in the production of sequencing data), describe the affect that library preparation can have on downstream data, and run through a NGS pipeline from raw sequencing data through to aligning the data against a reference genome and preparing it for the analysis on day 2.

Today’s Schedule

Session 1: A rundown of NGS concepts and setting up the VM.
- What is Next Generation Sequencing
- NGS pipelines in bash
- Effect of library preparation on raw data
- Key NGS file types
- Introduction to FASTQ file format
Session 2: FastQC Concepts and Adapter Trimming
- Data quality checks on raw data
- Interpreting FastQC output
- Use of ngsReports package
- Adapter Trimming
Session 3: Building a Genome Reference Index
- Data quality checks on trimmed data
- Interpreting changes in FastQC output
- Downloading a Reference Genome
- Building an Index
Session 4: STAR Alignment and Indexing with SAMtools
- Introduction to STAR
- Data alignment with STAR
- Introduction to SAMTools
- Indexing of aligned files
- Wrap up and Close

Setup the directory for today

Just as we’ve created a new R Project for Days 1 and 2, let’s create a new one for today to make sure we’re all in the same place. We will not be writing R code in this session, but we will be using the terminal from within R studio to run bash commands.

Using the File menu at the top left, select New Project
Select New Directory
Select New Project
If you’re not already asked to create this project as a subdirectory of ~, navigate to your home directory using the Browse button
In the Directory Name space, enter Day_3, then hit the Create Project button.

This again helps us keep our code organised and is good practice. To run bash commands, open RStudio as you have for the previous practicals and make sure the Console window is visible. Inside this pane, you will see a Terminal Tab so click on this and you will be at an interactive terminal running bash.

Next Generation Sequencing

Introduction to NGS

Before we can begin to analyse any data, it is helpful to understand how it was generated. Whilst there are numerous platforms for generation of NGS data, today we will look at the Illumina Sequencing by Synthesis method, which is one of the most common methods in use today. Many of you will be familiar with the process involved, but it may be worth looking at the following 5-minute video from Illumina. As setting up the sound with the VMs can be tricky, it will be easier to view this from your own regular browser. Briefly minimise the VM, open your regular browser and please use your headphones if you brought them.

This video refers to the process tagmentation. This is a relatively recent method for fragmenting and attaching adapters to DNA, with an alternative, more traditional methods being sonication, poly-adenylation and attachment of appropriate adaptors in separate steps. This step may vary depending on your experiment, but the important concept to note during sample preparation is that the DNA insert has multiple sequences ligated to either end. These include 1) the sequencing primers, 2) index and/or barcode sequences, and 3) the flow-cell binding oligos. To demonstrate these concepts further, observe the following figure that shows the DNA construct needed to run an illumina sequencing run, and the amplification steps required:

library preparation

Library preparation batch effects

It is important to be aware that as there are multiple library preparation steps, each of these offer opportunities for confounding factors and biases to be introduced. Variation and bias can be introduced during library construction due to random hexamer mispriming, contamination by off-target transcripts, fragmentation size or the amplification of PCR artifacts. With different runs on different machines you can get other sources of variability due to sequencing protocols, the device used (e.g. IonTorrent, Illumina, PacBio) and the sequencing depth. All of these sources of variability with the data can lead to so-called batch effects which may need to be compensated for in the experimental design if the samples being compared were not processed in the same experimental run.

Key File Types

This is a good opportunity to explore a few important file types and utilities within the Bioinformatics field.

TOC {:toc}

FASTA Files

Most of us have seen these, and the basic format is very simple. A sequence is made up of two components, a header line (starting with >) and one or more lines comprising the sequence information. It is typical for the sequence to be wrapped at 50 or 80 characters. A FASTA file can contain one or more of these sequences and can be used for DNA, RNA or amino acid sequence data. Here’s an example FASTA formatted sequence:

>NC_000067.6:5588450-5606134 Mus musculus strain C57BL/6J chromosome 1, GRCm38.p6 C57BL/6J fragment
GCGGGGGCAACCAGCCCTGCAGCTCTCAGCTTGCCTGTCCCCGGCGCACCTTGCTGATCCCAAACAGGCA
GAGCTTCTTCCAGTCTTGGAAGGCACAAATTGAGCATCAGGAACGTGGACCCATCAGGGCTGAACAGCTA
CTCAGGATCTAAAGTGGTGACTTGGAAAGCTGACGGTGACTTGGGAAGGGAGGTCGCCAATCAGCGATCT
GGAGGTGAGCTTGGGGCTTTCGAGGCACAGTGGGGAAATTGCTGTGGCTAGTGCTTCTTGGGGTTGCATG
CACAGACAGGCAGGGTAGAGGCAGCCACAGGAGTGGA

This is the format genomes are provided in by all genomic repositories such as Ensembl, NCBI and the UCSC. Each chromosome is specified by the header, with the entire sequence following.

FASTQ File Format

FASTQ is an extension of FASTA Not only does it contain sequence information, it contains quality information about each base. This is the format in which most sequencing data is provided and commonly uses .fq or .fastq as the file extention.

These files are often very large and easily compressed. As such, they will often be compressed using gzip and thus have .fastq.gz as their file extension. This compression provides two benefits: 1) they occupy less space on disk and 2) faster reading of a file from disk. Whilst you might be tempted to decompress these files in order to use them, most tools capable of reading FASTQ files can directly read the gzip compressed files.

SAM/BAM Files

Sequence Alignment/Map (SAM) files are plain text files containing read alignments. The binary version of a SAM file is known as a BAM file and is a more computer-friendly format. SAMtools’ samtools view can be used to convert between SAM <-> BAM. BAM files are typically 5-10 fold smaller than their corresponding SAM format file.

BED Files

A BED format file can be used to “annotate”" a region of a genome by storing genome coordinates as a tab-delimited text file of at least 3 columns providing the Chromosome (chrom), start (chromStart) and end (chromEnd) positions. In this way we can simply define genomic regions of interest that we have found in our analysis, and can visualise them.

A full description of the format is available at: https://genome.ucsc.edu/FAQ/FAQformat.html#format1

Tools have been developed which allow BED files to be used to extract information, from other file types, which intersect in some way with coordinates defined in a BED file. For instance, obtaining a subset of alignments from a larger file, extract variant calls or gene models/features etc.

GFF/GTF Files

General Feature Format (GFF) and General Transfer Format (GTF) are two very similar file formats for storing chromosome “features” such as gene annotations (e.g. exons, transcripts, genes). GFF comes in two versions, 2 and 3. GTF is sometimes refered to as GFF v2.5. GFF and GTF are both tab-delimited text files with pretty restrictive columns by design:

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the ‘chr’ prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
source - name of the program that generated this feature, or the data source (database or project name)
feature - feature type name, can only take the values “CDS”, “start_codon”, “stop_codon”, “5UTR”, “3UTR”, “inter”, “inter_CNS”, “intron_CNS” and “exon” (CNS stands for Conserved Noncoding Sequence)
start - Start position of the feature, with sequence numbering starting at 1.
end - End position of the feature, with sequence numbering starting at 1.
score - A floating point value (i.e. decimal points are allowed)
strand - defined as + (forward) or - (reverse).
frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, and so on…
attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature. In the GTF format two mandatory features are required here, although they can be an empty string, i.e. "":
- gene_id value
- transcript_id value

Notice that there’s no real way to represent our FOXP3 sites as a GTF file! This format is really designed for gene-centric features as seen in the 3rd column. An example is given below. Also note that header rows are not controlled, but must start with the comment character #

# Data taken from http://mblab.wustl.edu/GTF22.html
381 Twinscan  CDS          380   401   .   +   0  gene_id "001"; transcript_id "001.1";
381 Twinscan  CDS          501   650   .   +   2  gene_id "001"; transcript_id "001.1";
381 Twinscan  CDS          700   707   .   +   2  gene_id "001"; transcript_id "001.1";
381 Twinscan  start_codon  380   382   .   +   0  gene_id "001"; transcript_id "001.1";
381 Twinscan  stop_codon   708   710   .   +   0  gene_id "001"; transcript_id "001.1";

Note: People variously use GFF and GTF to talk about GFF version 2, and GFF to talk about GFF version 3. GFF2 is not compatible with GFF3, so make sure you have the correct file format if you are given a GFF file. There are conversion tools available to inter-convert them, they are rarely reliable.

Genome Browsers

Genome browsers are applications that provide a way to view, explore and compare genomic information in a graphical environment.

Genome browsers enable researchers to visualize and browse entire genomes with annotated data including gene prediction and structure, proteins, expression, regulation, variation, comparative analysis, and so on. Annotated data is usually from multiple diverse sources. They differ from ordinary biological databases in that they display data in a graphical format, with genome coordinates on one axis and the location of annotations indicated by a space-filling graphic to show the occurrence of genes and other features¹.

UCSC

{:.no_toc}

Often we’ll use a genome browser running on our local machines (IGV), but a good one to start with is the web-based UCSC browser. Click this link and you should see a slightly intimidating screen full of information.

You’ll be able to see:

Genes with their transcript structure at the top, along with their position on the genome, as defined by GENCODE
RefSeq-based gene models
Gene expression from multiple tissues
Alignments with other species
A whole lot of other information…

Once you’ve got a handle on what’s there, locate the hide all button and click that, which will just give the genomic region with no track information. We can turn on a huge variety of “tracks” which contain genomic information that we may care about. Let’s start by turning on the GENCODE transcripts again.

Under the Genes and Gene Predictions section, find the GENCODE VM20 drop-down menu and click the arrow next to the word ‘hide’. Change this to ‘full’ and hit one of the refresh button you can see scattered across the page. Now the transcripts will appear again in a less cluttered display. Under the hood, the browser has used this information saved as a BED file, which enables us to define genomic regions in a convenient tab-delimited format. We’ll look at these in more detail soon.

As well as showing the transcript structure, we can also show simple genomic features like a SNP, so let’s look for the Variation region down the page a little, then set the Common SNPs(142) track to full as well. To make these changes appear, hit a refresh button again and the browser will now have this track showing. This is pretty crazy, so we can condense this using the pack option. Try this then the dense and squish options to see what difference they all make.

If you haven’t already tried it, you can click on any of the genomic features and you’ll be taken to a page containing all the key information about that feature. You can also drag your mouse over regions to zoom in, and can zoom out using the buttons at the top of the page. Type the name of your favourite gene into the search box and you’ll be able to find your way to that. If you can’t think of one, just enter Oprk1 and you’ll be taken to a page full of choices. Select the top transcript from the page, under Gencode Genes. This will take you back to the browser, but just showing the region for the selected transcript.

Now we’ve had a brief exploration of the browser, let’s look at some file types which will enable us to upload custom features, and which are useful at numerous stages during analysis using NGS data.

There are two videos that go through the kinds of things you can do with the UCSC genome browser here and here. Take some time later to have a look at these.

Hands-on Tutorial

In the terminal, make sure you are in the Day_3 directory within your home directory.

cd ~/Day_3/

You must ensure that you use this exact path as any variations will undoubtedly cause you problems and lead to unnecessary confusion later on. You can use the command pwd to ensure the path you are in is correct.

Use the mkdir command to create a new directory called 0_rawData within your Day_3 directory, which will be the location for raw FASTQ files we have obtained from a sequencing system.

mkdir ~/Day_3/0_rawData

We now want to use the cp command to copy the raw FASTQ files to your new 0_rawData folder

cp --recursive \
  ~/data/Day_3/0_rawData/* \
  ~/Day_3/0_rawData/

# Have a look to see what files you have available
ls ~/Day_3/0_rawData/fastq

The command zcat can be used to expand a *.gz file and print it’s contents to the terminal via standard output (STDOUT). If we do this to a large FASTQ file we would see a stream of data whizzing past in the terminal - not very helpful. Instead we can pipe the output of zcat to the command head in order to only view the first 10 lines of the file.

cd ~/Day_3/0_rawData/fastq
zcat SRR945375.skm.fastq.gz | head

Using the argument --lines we can modify the behaviour of head to return a specified number of lines, for example --lines 8 will return the first 8 lines of a file.

zcat SRR945375.skm.fastq.gz | head --lines 8

You may also remember the pipe symbol | from the above command from day 2. In this example we have taken the output of the zcat command (zcat SRR945375.skm.fastq.gz) and redirected it to another command (head). There are no limits to the number of commands we can string together using pipe commands, allowing us to create pipelines for data from a source through to a destination via different transformations.

In the output from the above terminal command, we have obtained the first 8 lines of the gzipped fastq file. This gives a clear view of the fastq file format, where each individual read spans four lines. These lines are:

1. The read identifier
2. The sequence read
3. An alternate line for the identifier (commonly left blank as just a + symbol acting as a placeholder)
4. The quality scores for each position along the read as a series of ascii text characters.

Let’s have a brief look at each of these lines and what they mean.

The read identifier

This line begins with an @ symbol and although there is some variability, it traditionally has several components. The data we have been working with over the last 3 days was sourced from an EBI data repository, and this particular file has the identifier SRR945375. For the first sequence in this file, we have the full identifier @SRR945375.39 HWI-ST1145:52:D0PV8ACXX:3:1101:14387:1975:213/1 which has the following components:

| @SRR945375.39 | The aforementioned EBI identifier and the sequence ID within the file.
    If this was the first read, we would have the number 1.
    However as this is a chromosome 1 subset from a genome dataset, so this particular file begins at read number 39.
    NB: This identifier is not present when data is obtained directly from the machine or service provider.|
| HWI-ST1145:52:D0PV8ACXX | The unique machine ID |
| 3 | The flowcell lane |
| 1101 | The tile within the flowcell lane |
| 14387 | |
| 1975 | The x-coordinate of the cluster within the tile |
| 213 | The y-coordinate of the cluster within the tile |
| /1 | Indicates that this is the first read in a set of paired-end reads |

As seen in the subsequent sections, these pieces of information can be helpful in identifying if any spatial effects have affected the quality of the reads. By and large you won’t need to utilise most of this information, but it can be handy for times of serious data exploration.

Quality Scores

The only other line in the fastq format that really needs some introduction is the quality score information. These are presented as single ascii text characters for simple visual alignment with the sequence, and each character corresponds to a numeric value, which is the quality score. In the ascii text system, each character has a numeric value which we can interpret as an integer. Head to the website with a description of these at ASCII Code table.

The first 31 ASCII characters are non-printable and contain things like end-of-line marks and tab spacings, and note that the first printable character after the space (character 32) is “!” which corresponds to the value 33. In short, the values 33-47 are symbols like !, #, $ etc, whereas the values 48-57 are the characters 0-9. Next are some more symbols (including @ for the value 64), with the upper case characters representing the values 65-90 and the lower case letters representing the values 97-122.

The PHRED +33/64 Scoring System

Now that we understand how to turn the quality scores from an ascii character into a numeric value, we need to know what these numbers represent. The two main systems in common usage are PHRED +33 and PHRED +64 and for each of these coding systems we either subtract 33 or 64 from the numeric value associated with each ascii character to give us a PHRED score. As will be discussed later, this score ranges between 0 and about 41.

The PHRED system used is determined by the software installed on the sequencing machine, with early machines using PHRED + 64 (casava <1.5), and more recent machines tending to use PHRED + 33. For example, in PHRED +33, the @ symbol corresponds to Q = 64 - 33 = 31, whereas in PHRED +64 it corresponds to Q = 64 - 64 = 0.

The following table demonstrates the comparative coding scale for the different formats:

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&’()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]ˆ_‘abcdefghijklmnopqrstuvwxyz{|}~
|                         |    |        |                              |                     |
33                       59   64       73                             104                   126
S - Sanger Phred+33, raw reads typically (0, 40)
X - Solexa Solexa+64, raw reads typically (-5, 40)
I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)
J - Illumina 1.5+ Phred+64, raw reads typically (3, 40)
L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Interpretation of PHRED Scores

The quality scores are related to the probability of calling an incorrect base through the formula

Q = −10log₁₀P

where P is the probability of calling the incorrect base. This is more easily seen in the following table:

PHRED Score	Probability of Incorrect Base Call	Accuracy of Base Call
0	1 in 1	0%
10	1 in 10	90%
20	1 in 100	99%
30	1 in 1000	99.9%
40	1 in 10000	99.99%

Now we understand how quality scores are encoded in the FASTQ file, it gives us a basis for understanding how the rest of the quality control process works. The FastQC application which compiles and reports on FASTQ file data quality is covered in the next session.

Questions