Welcome to Spring Into Bioinformatics for 2019. Over this 3 day course we’ll hopefully cover enough concepts to get you started with your data and analyses. This course will provide the most benefit if you continue to use the skills in the weeks directly after the course, and is aimed at those with minimal to no prior bioinformatics expertise. Course material will be available at this URL indefinitely.
Most of the sessions will be self-guided, with key direction provided sporadically at important times. Please ask as many questions as you need. The tutors are specifically here to help you understand and develop your skills, so please ensure you take full advantage of their availability.
We strongly encourage you to a) read all of the notes, and b) manually type all of the code (unless directed otherwise). This will provide you with the the most benefit.
This course was primarily written by members of the Bioinformatics Hub and the tutors across the three days will be:
As you will no doubt be aware, R is one of the most commonly used languages/environments in modern biological research. Whilst originally developed as a statistical teaching tool in the 1990s, a large and diverse ecosystem of packages has been developed enabling analysis of everything from financial markets, to election polls to biological research. There is truly too much to cover in a 3 day course, but hopefully we’ll get you over the initial hurdle of unfamiliarity.
There is no such thing as a perfect programming language, and R has many features (or ‘quirks’) that are based in its historical origins. However, it can be a very useful tool. An important concept to remember from these sessions, is that we write code for two primary reasons.
A key advantage of R over software like Excel, is that everything you do is recorded as a script. Every change you make to your data can be revisited months or years later, and every analysis or figure can also be recreated at a later stage, or easily modified as more data points are collected. If you accidentally ‘break’ or modify your data in Excel, there will literally be no record of this event, with cut and paste errors being essentially invisible. In R, these mistakes are (generally) easy to find and correct ensuring reproducible and robust analysis, as well as enabling collaboration between researchers.
R is an open source language, meaning there is no software giant like Microsoft or Apple forcing you to buy their software, and hiding all the code so you don’t really know what it’s doing. Instead, all of the code that runs R is publicly visible, and maintained by what is essentially a volunteer community, consisting mainly of academics spread throughout the world.
Some examples of trivial R code might be as follows. Don’t worry about typing these just yet. Instead, have a look and try to understand everything you see at this point.
x <- 1:5
print(x)
## [1] 1 2 3 4 5
sqrt(x)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068
x + 1
## [1] 2 3 4 5 6
We’ll come back and actually run these examples later, but for now you may be able to see that in the first line x <- 1:5
we’ve created an object called x
which has the values 1:5
. We used the <-
symbol, which was specifically designed to look like an arrow, to put those values into x
.
After that we performed a few operations on x
such as showing what values are in x
(print(x)
), finding the square root of all of the values in x
(sqrt(x)
), or adding one to all of the values x + 1
. Again, we’ll come back to this very soon.
R will always print the position of the values which start the line too, so that is why you see [1]
at the start of each of these examples. This can be very handy when printing tens or hundreds of values. We’ve also addedd the ##
symbol at the start or each line to clearly denote this is R output. This is non-essential, but a common convention.
Whilst R itself is the language we use, we mostly interact with R using an Integrated Development Environment (IDE) called R Studio. We can almost think of R Studio as being like the cabin of a car, and R itself is the engine. Though we can tell the engine (R) exactly what to do from the cabin (R Studio), R Studio also has many features that don’t directly interact with R, but that our make our lives more safe and convenient, just like a cabin will usually have a radio, air-conditioning and seat belts. We can use R Studio as a file browser, text editor and bash interface as well as running version control software. For today, we’ll stick to the text editor (where we write our code) and file browser, although we will use bash
more and more as the course progresses.
Interestingly, R Studio is produced by a company (called R Studio) and as such doesn’t have some of the same open source features. However, it is still free software. You will have already been through the login process, and this is the point where you will first see what R Studio looks like to a user.
The main RStudio interface
The main window you can see on the left has a few tabs available so ensure the tab with the word Console is active. This is where we can interact directly with R itself. At the top you’ll see the R Version listed along with a few other pieces of information, whilst underneath this you’ll see the >
symbol. This is the R prompt and us where can type our code to be executed.
On the top right you’ll see another pane with three tabs. We won’t explore all of these, but the Environment and History tabs can be very useful.
On the bottom right you’ll see another pane with tabs for Files, Plots, Help and a few other things. We’ll definitely use those three, but the other two won’t get much of a run.
Just to get a feel for how R works, let’s try typing a few things directly into the R Console. At the >
prompt, type the following:
x <- 1:5
As we discussed earlier, we’ve now created an object called x
which contains the numbers 1
to 5
. This is an R object, known as a vector. We won’t really spend much time talking about vectors in this course, but here it’s a string of numbers, which you can think of as being like a single column in a spreadsheet.
We can view the contents of this R object either by typing it’s name, or by using the function print()
. Mostly we don’t need the print()
function but it does come in handy sometimes.
x
print(x)
Just like you might call the Excel function AVERAGE()
and select a range of values to find the mean, we can do this in R using the equivalent function mean()
.
mean(x)
We can use all sorts of mathematical functions like sum()
, max()
, min()
, sqrt()
, median()
, var()
or sd()
on our vector. The important point to notice is that when we call the function, we pass the R object to the function by including the object between the brackets ()
, just like we did with our calls to print()
and mean()
. This is very similar to how the Excel functions work, but instead of passing a range of cells, we pass an R object to the function. Try a few of the above commands to see how you go.
We can also perform mathematical operations directly on our R object, which, as you can see, is much easier than highlighting a range of cells in Excel.
x + 1
x*2
x^2
When working with Excel and calling a function on a range of cells, we would usually write the output to another cell in the spreadsheet. However, in the above we just wrote the output of each function directly to the console without modifying x
, or without saving the output anywhere.
Have a look at your Environment Tab in the top right pane, and note that you can see the object x
there. The R Environment is where we created the object x
and this is like our workspace where we can store any R objects. As an example, let’s save our output from that last mathematical operation where we calculated \(x^2\).
x_sq <- x^2
Notice that the results of x^2
were no longer printed to the console, and we now have two objects in our R Environment (both x
and x_sq
). How would we view the contents of x_sq
?
In the above, we initially created our object x
by giving it the values 1 to 5. R interprets the :
symbol as a quick way to create a sequence from the first up to the second value, in steps of 1. A far more common method for creating vectors is to enclose all the values we need inside the function c()
. This function stands for combine, and example may be:
fib <- c(0, 1, 1, 2, 3, 5)
This example is just the start of the Fibonacci sequence, but we can use any values we want. Notice that these were all numbers. We can also specify character (i.e. text) vectors by quoting the supplied values.
pets <- c("dog", "cat", "bird")
R also has a convention that text strings must be quoted. If you see text which is not quoted, this will be referring to an R object, or sometimes the column name of a spreadsheet-like object we’re working with.
As an important technical note, R can only create vectors where all values are the same type. Character vectors will always display with quotes around the text when you print them in the R console. If you mix numbers amongst text values, the numbers will be coerced to text strings and will also appear with quotes, indicating they have been stored as the alpha-numeric symbols for those numbers, but with no implicit value. This is quite different to Excel, where each value is treated in isolation, but once you get used to it, it’s actually very useful.
Finally for this section, if you’re not sure how a function works, the easiest way to call up the help page is to type the name of the function into the R Console, but with a ?
beforehand. As an example, try typing ?sd
into the console. This will call up the help page for sd()
(which calculates the standard deviation of a vector).
This function takes two arguments.
x
by the function)na.rm = FALSE
).Some help pages can be a little harder to understand than this one but the more you use R
, the clearer they become. We’ll check a few more throughout the day and hopefully you’ll be able to make sense of them by the end of the course.