Day 2

Welcome

Welcome to Spring Into Bioinformatics for 2019. Over this 3 day course we’ll hopefully cover enough concepts to get you started with your data and analyses. This course will provide the most benefit if you continue to use the skills in the weeks directly after the course, and is aimed at those with minimal to no prior bioinformatics expertise. Course material will be available at this URL indefinitely.

Most of the sessions will be self-guided, with key direction provided sporadically at important times. Please ask as many questions as you need. The tutors are specifically here to help you understand and develop your skills, so please ensure you take full advantage of their availability.

We strongly encourage you to a) read all of the notes, and b) manually type all of the code (unless directed otherwise). This will provide you with the most benefit.

Tutors

This course was primarily written by members of the Bioinformatics Hub and the tutors across the three days will be:

  • Nathan Watson-Haigh, Dan Kortschak and Mark Armstrong (Bioinformatics Hub Staff)
  • Jimmy Breen (Bioinformatics Hub / SAHMRI)
  • Nhi Hin (PhD Candidate, Bioinformatics Hub)
  • Melanie Smith (PhD Candidate, Robinson Research Institute)
  • Terry Bertozzi (SA Museum)

Working with alignments and differential expression

In keeping with the format of this course where we essentially started at the end and worked our way forward, the the last Day we are going to learn the initial steps needed to process high-throughput sequencing data. In Day 1 we learnt how to analyse and organise data in R, use important data processing packages contained in the tidyverse ecosystem, and perform plots that you would commonly create for Transcriptome sequencing projects (e.g. Volcano plot, Mean-difference plot etc). Today we are going learn how to quantify aligned data and create a gene counts table, run common functions to assess sample and project quality (unsupervised clustering and estimation of library sizes) and perform differential expression analyses.

Today’s Schedule

  • Session 1: A quick introduction to bash and setting up the VM

  • Session 2: Working with Alignments
  • Quality control of alignment data
  • Duplication/Deduplication with Picard
  • Assessment of alignment rate and multi-mapping

  • Session 3: Gene quantification and counts
  • Summarisation of aligned data to counts

  • Session 4: Differential Expression
  • Quality control of samples based on mapping counts
  • Estimation of library sizes
  • PCA/MDS assessment of outliers and batch effects
  • Choosing count distribution model: voom vs edgeR

A crash course in bash

Pardon the pun.

Introduction

Yesterday we introduced the language R, which is heavily used in bioinformatics. R is particularly effective for interacting with and visualising data, as well as performing statistical analyses. However, in modern bioinformatics R is commonly used in the middle to late stage of many analyses. Today we’re going to be interacting with programs and data using the command-line and using the language called bash.

We can utilise bashin two primary ways:

  1. Interactively through the terminal
  2. Through a script which manages various stages of an analysis.

Today we will work interactively using the terminal by executing commands and having a look at our alignment data. Before we start getting into our alignments however, we need to go through the basics of bash and how we execute commands.

Setup the directory for today

Just as we’ve created a new R Project for Day 1, let’s create a new one for today to make sure we’re all in the same place.

  • Using the File menu at the top left, select New Project
  • Select New Directory
  • Select New Project
  • If you’re not already asked to create this project as a subdirectory of ~, navigate to your home directory using the Browse button
  • In the Directory Name space, enter Day_2, then hit the Create Project button.

This again helps us keep our code organised and is good practice.

Running bash on your VM

All computers running MacOS and Linux have a terminal built-in as part of the standard setup, whilst for Windows there are several options, including git bash which also enables use of version control for your scripts. To keep everything consistent for this practical, we’ll use the terminal which is available inside RStudio. Note that even though we’re using RStudio, we won’t be interacting with R today as R runs interactively in the Console. Instead we’ll be using one of the other features provided by RStudio to access bash

To access this, open RStudio as you have for the previous practicals and make sure the Console window is visible. Inside this pane, you will see a Terminal Tab so click on this and you will be at an interactive terminal running bash.

Historically, bash is a replacement for the earlier Bourne Shell, written by Stephen Bourne, so the name is actually a hilarious joke. We’ll explore a few important commands below, and the words shell and bash will often be used interchangeably with the terminal window.

Although we haven’t specifically mentioned this up until now, your virtual machines are actually running the Ubuntu flavour of Linux, and we can access these machines by logging in remotely, as well as through the RStudio interface which you used yesterday. Most High-Performance Computing (HPC) systems you use will require a knowledge of Linux so these practicals will give you the basics skills for working in this environment. Most of the data analysis performed by the Bioinformatics Hub relies on the University of Adelaide HPC for data storage and data processing.

Finding your way around

Once you’re in the Terminal section of RStudio, you will notice some text describing your computer of the form

sib2019@2019-sib-master:~/Day_2$

The first section of this describes your username (sib2019) and the machine 2019-sib-master. The end of the machine identifier is marked with a colon (:).

After the colon, the string (~/Day_2) represents your current directory, whilst the dollar sign ($) indicates the end of this path and the beginning of where you will type commands. This is the standard interface for the Bourne-again Shell, or bash.

Where are we?

pwd

Type the command pwd in the terminal then press the Enter key and you will see the output which describes the current directory you are in.

pwd

The command pwd is what we use to __p__rint the current (i.e. __w__orking) __d__irectory. Even though we are not using R, if you have setup the R project like we instructed above this command will probably return the directory.

/home/sib2019/Day_2

Check with your neighbour to see if you get the same thing. If not, see if you can figure out why.

At the beginning of this section we mentioned that ~/Day_2 represented your current directory, but now our machine is telling us that our directory is /home/sib2019/Day_2. This raises an important and very useful point. In bash the ~ symbol is a shortcut for the home directory of the current user. If Dan was logged in, this would be /home/Dan whilst if Steve was logged in this would be /home/Steve. As we are all logged on as sib2019, this now stands for /home/sib2019. (Formally, ~ is a variable, but we’ll deal with variables later.)

Importantly every user with an account on a machine will have their own home directory of the format /home/username1, /home/username2 etc.. Notice that they will all live in the directory /home which is actually the parent directory that all users will have a home directory in, as we’ve just tried to explain. This can be confusing for many people, so hopefully we’ll clear this up in the next section or two.

In the above, the /home directory itself began with a slash, i.e. /. On a unix-based system (i.e. MacOS and Linux), the / directory is defined to be the root directory of the file system. Windows users would be more familiar with seeing C:\ as the root of the computer, and this is an important difference in the two directory structures. Note also that whilst Windows uses the backslash (\) to indicate a new directory, a Linux-based system uses the forward slash (/), or more commonly just referred to simply as “slash”, marking another but very important difference between the two.

cd

{:.no_toc}

Now we know all about where we are, the next thing we need to do is go somewhere else. The bash command for this is cd which we use to change directory. No matter where we are in a file system, we can move up a directory in the hierarchy by using the command

cd ..

The string .. is the convention for one directory above, whilst a single dot represents the current directory.

Enter the above command and notice that the location immediately to the left of the $ has now changed. Enter pwd again to check this makes sense to you.

If we now enter:

cd ..

a couple more times we should be in the root directory of the file system and we will see /$ at the end of our prompt. Try this and print the working directory again (pwd). The output should be the root directory given as /.

We can change back to our home folder by entering one of either:

cd ~

or

cd

The initial approach taken above to move through the directories used what we refer to as a relative path, where each move was made relative to the current directory. Going up one directory will clearly depend on where we are when we execute the command.

An alternative is to use an absolute path. An absolute path on Linux/Mac will always begin with the root directory symbol /.

For example, /foo would refer to a directory called foo in the root directory of the file system (NB: This directory doesn’t really exist, it’s an example). In contrast, a relative path can begin with either the current directory (indicated by ./) or a higher-level directory (indicated by ../ as mentioned above). A subdirectory foo of the current directory could thus be specified as ./foo, whilst a subdirectory of the next higher directory would be specified by ../foo.

Another common absolute path is the one mentioned right at the start of the session, specified with ~, which stands for your home directory /home/sib2019, which also starts with a /.

We can also move through multiple directories in one command by separating them with the slash /. For example, we could also get to the root directory from our home directory by typing

cd ../../

Return to your home directory using cd.

In the above steps, this has been exactly the same as clicking through directories in our familiar folder interface that we’re all familiar with. Now we know how to navigate folders using bash instead of the GUI. This is an essential skill when logged into a High Performance Computer (HPC) or a Virtual Machine (VM) as the vast majority of these run using Linux.

Important!!

Although we haven’t directly discovered it yet, most file systems used on Unix-based systems such as Ubuntu are case-sensitive, whilst Windows file systems are usually not. For example, the command PWD is completely different to pwd and doesn’t actually exist on your (or any) default installation of bash.

If PWD happened to be the name of a command which has been defined in your shell, you would get completely different results than from the intended pwd command. Most bash tools are named using all lower-case, but there are a handful of exceptions.

We can also change into a specific directory by giving the path to the cd command using text instead of dots and symbols. Making sure you’re in your home directory we can change back into the Day_2 directory

cd
cd Day_2
pwd

This is where we started the session.

Tab auto-completion

In a similar way that RStudio offered ‘suggestions’ when we start typing the name of a function, bash has the capacity for auto-completion as well. This will help you avoid a ridiculous number of typos.

If you start typing something bash will complete as far as it can, then will wait for you to complete the path, command or file name. If it can complete all the way, it will.

Let’s see this in action and start becoming keyboard heroes. Change into your home folder.

cd

Now to change back into your Day_2 folder, type cd Pr without hitting enter. Instead hit your Tab key and bash will complete as far as it can. If you have setup your directories correctly, you should see this complete to cd Day_ which is unfinished. You should have Day_1 in your home folder, so bash has gone as far as it can. Now it’s up to us to enter the final 2 before hitting Enter.

When faced with multiple choices, we can also hit the Tab key twice and bash will give us all available alernatives. Let’s see this in action by changing back to our home folder.

cd

Now type cd Da and hit the Tab key twice and you will be shown all of the alternatives. You’ll still have to type the 2 though.

Another example which will complete all the way for you might be to go up one from your home folder.

cd
cd ..

Now to get back to your home directory (/home/sib2019) start typing cd b followed by the Tab key. This should auto-complete for you and will save you making any errors. This also makes navigating your computer system very fast once you get the hang of it.

Importantly, if tab auto-completion doesn’t appear to be working, you’ve probably made a typo somewhere, or are not where you think you are. It’s a good check for mistakes.

Looking at the Contents of a Directory

There is another built-in command (ls) that we can use to list the contents of a directory. This is a way to get our familiar folder view in the terminal. Making sure you are in your home directory (cd ~), enter the ls command as it is and it will print the contents of the current directory.

ls

This is the list of files that we normally see in our traditional folder view that Windows and MacOS show us by default. We can actually check this output using RStudio too, so head to the Files tab in the Files window. Click on the Home icon (home) and look at the folders and files you can see there. Do they match the output from ls? Ask for help if not.

Alternatively, we can specify which directory we wish to view the contents of, without having to change into that directory. Notice you can’t do actually this using your classic GUI folder view. We simply type the ls command, followed by a space, then the directory we wish to view the contents of. To look at the contents of the root directory of the file system, we simply add that directory after the command ls.

ls /

Here you can see a whole raft of directories which contain the vital information for the computer’s operating system. Among them should be the /home directory which is one level above your own home directory, and where the home directories for all users are located on a Linux system.

cd 
ls Day_1

Manuals and Help Pages

Accessing Manuals

In order to help us find what options are able to be specified, every command built-in to the shell has a manual, or a help page which can take some time to get familiar with. These help pages are displayed using the pager known as less which essentially turns the terminal window into a text viewer so we can display text in the terminal window, but with no capacity for us to edit the text, almost like primitive version of Acrobat Reader.

To display the help page for ls enter the command

man ls

As beforehand, the space between the arguments is important and in the first argument we are invoking the command man which then looks for the manual associated with the command ls. To navigate through the manual page, we need to know a few shortcuts which are part of the less pager.

Although we can navigate through the less pager using up and down arrows on our keyboards, some helpful shortcuts are:

Key Action
Enter go down one line
Spacebar go down one page (i.e. a screenful)
B go backwards one page
< go to the beginning of the document
> go to the end of the document
Q quit

Look through the manual page for the ls command.

Creating a New Directory

Now we know how to move around and view the contents of a directory, we should learn how to create a new directory using bash instead of the GUI folder view you are used to. Navigate to your home folder using bash.

cd ~/Day_2

Now we are in a suitable location, let’s create a directory called test. To do this we use the mkdir command as follows:

mkdir test

You should see this appear in the GUI view, and if you now enter ls, you should also see this directory in your output.

Importantly, the mkdir command above will only make a directory directly below the one we are currently in as we have used a relative path. If automating this process via a script it is very important to understand the difference between absolute and relative paths, as discussed above.

What happens if the directory already exists?

We can prevent this error by using a parameter to the mkdir command.

man mkdir

What does the -p flag do in the mkdir command?

Text In the Terminal

We can display a line of text in stdout by using the command echo. The most simple function that people learn to write in most languages is called Hello World and we’ll do the same thing today.

echo 'Hello World'

That’s pretty amazing isn’t it? and you can make the terminal window say anything you want without meaning it.

echo 'This computer will self destruct in 10 seconds!'

There are a few subtleties about text which are worth noting. If you have man pages accessible, inspect the man echo page and note the effects of the -e option. (Unfortunately you can’t access this using echo --help.) The -e option allows you to specify tabs (\t), new lines (\n) and other special characters by using the backslash to signify these characters. This is an important concept and the use of a backslash to escape the normal meaning of a character is very common, as we saw with grep last time. Try the following three commands and see what effects these special characters have.

echo 'Hello\tWorld'
echo -e 'Hello\tWorld'
echo -e 'Hello\nWorld'

As we’ve seen above, the command echo just repeats any subsequent text. Now enter

echo ~

Why did this happen?

How To Not Panic

It’s easy for things to go wrong when working in the command-line, but if you’ve accidentally:

  • set something running which you need to exit or
  • if you can’t see the command prompt, or
  • if the terminal is not responsive

Ctrl+C is usually the first port of call when things go wrong.

Redirection Using The Pipe Symbol

Sometimes we need to build up our series of commands and send the results of one to another. The pipe symbol (|) is the way we do this and it can literally be taken as placing the output from one command into a pipe and redirecting it somewhere new. This is where thinking about the output of a command as a data stream can be very helpful. This is a very conventional approach when working in bash and was the motivation behind the creation of the magrittr package in R.

As a simple example, we could take the output from an ls command and send it to the pager less. The pager less essentially turns the terminal window into a text viewer so we can display text in the terminal window, but with no capacity for us to edit the text, almost like primitive version of Acrobat Reader.

Although we can navigate through the less pager using up and down arrows on our keyboards, some helpful shortcuts are:

Key Action
Enter go down one line
Spacebar go down one page (i.e. a screenful)
B go backwards one page
< go to the beginning of the document
> go to the end of the document
Q quit

For example, lets look at the how we can pipe one command into another

ls -lh /usr/bin | less

Page through the output (using Spacebar) until you get bored, then hit q to quit.

You can do this with a number of other handy unix tools called head, tail and wc. As you can probably work out from the first two commands, they enable you to look at the head (top) or tail (bottom) of a file. These commands are especially handy for looking at large files.

ls -lh /usr/bin | head

ls -lh /usr/bin | tail

You can also use the -n parameter to print a certain number of lines

ls -lh /usr/bin | head -n 2

The command wc stands for “Word Count” and is a really handy tool to count the number of lines, words or characters in a file. It works in a very similar way to the “Word Count” command in Microsoft Word! Lines are output first, then words and lastly characters.

ls -lh /usr/bin | wc

Pattern matching using grep

The built-in command which searches using “regular expressions” in the terminal is grep, which stands for global regular expression print. This function searches a file or input on a line-by-line basis, so patterns contained within a line can be found, but patterns split across lines are more difficult to find.

If we take the example we’ve done above, you can use grep to online extract lines that contain certain characters or words. In the /usr/bin lets search for any line in the directory that contains the word “command”

ls -lh /usr/bin | grep "command"

Thats quite specific, but you can also use smaller words or characters

ls -lh /usr/bin | grep "ls"

As you can see, a lot more lines contain the two consecutive characters “ls” than the word “command”.

Regular expressions more broadly are a powerful and flexible way of searching for text strings amongst a large document or file. Most of us are familiar with searching for a word within a file, but regular expressions allow us to search for these with more flexibility, particularly in the context of genomics. For example, we could search for a sequence that is either AGT or ACT by using the patterns A[GC]T or A(G|C)T. These two patterns will search for an A, followed by either a G or C, then followed strictly by a T. Due to time we won’t go in-depth into regular expressions, but we will delve more into those during Day 3.

Questions

{:.no_toc}

  1. How many lines were output from the ls -lh /usr/bin command?
  2. How many lines contain the word “cut”?

sed: The Stream Editor

One additional and very useful command in the terminal is sed, which is short for stream editor. Instead of the man page for sed the info sed page is larger but a little easier to digest. This is a very powerful command which can be a little overwhelming at first. If using this for your own scripts and you can’t figure something out, remember ‘Google is your friend’ and sites like are full of people wrestling with similar problems to you. These are great places to start looking for help and even advanced programmers use these tools.

For today, there are two key sed functionalities that we want to introduce.

  1. Using sed to alter the contents of a file/input;
  2. Using sed to print regions of a file

sed uses regular expressions that we have come across under the grep section, and we can use these to replace strings or characters within a text string. The command works in the form sed 'SCRIPT' INPUT, and the script section is where all the action happens. Input can be given to sed as either a file, or just as a text stream via the pipe that we have already introduced.

In the following example the script begins with an s to indicate that we are going to make a substitution. The beginning of the first pattern (i.e. the regexp we are searching for) is denoted with the slash, with the identical delimiter indicating the replacement pattern, and this is in turn completed with the same delimiter. Try this simple example from the link which is a very detailed and helpful resource about the usage sed. Here we are sending the input (echo Sunday) to the command via the pipe, so no INPUT section is required:

echo Sunday | sed 's/day/night/'

Here you are passing sed the string Sunday, and sed takes day and turns it into night.
sed will only replace the first instance of the string on any line, so try:

echo Sundayday | sed 's/day/night/'

It only replaced the first instance of day and left the second. You can make it ‘global’, where it switches every instance by using the g option at the end of the pattern like this:

echo Sundayday | sed 's/day/night/g'

Using cut

An extremely useful function when looking at delimited files, such as comma-separated or tab-separated text, is a command called cut. This command enables you to divide your files into specific columns.

Lets look at a very basic example.

echo -e 'this\tis\tthe\tfirst\tline\nthis\tis\tthe\tsecond\tline'

What happens if we want to extract the column that contains the word “first” and “second”? In other words, how do we extract the 4th column ONLY?

echo -e 'this\tis\tthe\tfirst\tline\nthis\tis\tthe\tsecond\tline' | cut -f 4

This is just a very basic example, but the cut command will become very important later when we are looking at SAM/BAM files. These files have many fields, so its often helpful to reduce the output information using cut.


Ok I told you it was going to be a crash course! If you’re having problems working with the command-line, don’t worry there will be plenty of helpers out there to give you some help. Now lets get into some biology!