Introduction

In our previous sections, we learned how to document our workflow using R Markdown using plain-text paragraphs and R code embedded in the same document. Once we’d learned that, we spent some time learning how to manipulate text strings. For today’s session, we’ll introduce a more common object type, known as a data frame (or tibble) which resembles a spreadsheet. Firstly, we’ll import the file into R, and then we’ll learn how to manipulate the data object inside the R environment.

Setting up our workspace

Before we go any further, we need to keep developing our good programming practice. If you don’t already have an R Project for practical_2, create a new project with this name as a subdirectory of your ~/transcriptomics directory. Call a tutor over if you can’t remember how to do this.

Once you have an R Project ready to go for practical_2, create a new directory called data within this directory. We already have your data for today on your VM and this will be located in ~/data/transcriptomics. The two files we’ll work on today are called Enrichment.csv and de.csv. Copy the file Enrichment.csv to practical_2/data so today’s data will be nicely contained with an R project.

Once you have Enrichment.csv in your data directory, begin a new R Markdown document and save this as DataImport.Rmd in your project root directory. As per usual, delete everything after the setup chunk.

Introducing The Tidyverse

Previously we saw the package stringr which was dedicated to manipulating text strings. Today we’re going to learn about functions from the packages readr, dplyr, tibble and tidyr. Because these packages are very tightly integrated, the authors decided to create a further package called the tidyverse which enables the loading of the eight core packages (ggplot2, tibble, tidyr, readr, purrr, dplyr, stringr and forcats) in one command. An entire programming culture has sprung up around these packages, which is often referred to as the tidyverse or tidy data, or something similar. In essence, the release of these packages and their ease of use, opened R up to a far wider community. All of these packages are maintained by staff from RStudio, in particular Hadley Wickham, who is one of the most influential figures within the international R community.

Hi! I’m Hadley Wickham & I wrote/designed most of the tidyverse. I’m originally from New Zealand, so my functions use English spelling instead of US spelling. Yay! (source: https://github.com/hadley)

After your setup chunk, create a new chunk called packages and enter the following code.

library(tidyverse)

This will load all 8 of the packages that were described earlier when executed. Compile your document and inspect the output.

You’ll probably notice that this chunk has been rather “chatty” and provided an unexpected amount of output in your final document. The first time you execute this chunk in an interactive session you would have seen the same output appear in your Console (or in your R Markdown if you have it set for that). This is what’s known as a message in R, and is just communicating information to us. It may look like like some kind of error, but this one isn’t.

The first few lines are informing us what packages have been loaded, and which versions they are. After --Conflicts----- should be two lines letting us know that the function filter() from dplyr is the first one R will look for in the session, and now takes precedence of the function filter() from stats. Similarly for the function lag(). This really isn’t super relevant for us, so we can ignore all of this.

One convenient piece of syntax that is shown here is the use of namespaces. As we can see that two packages have functions called filter() that are currently available to use. If we wish to specify which exact one to use, we can put the package name before the function name, and separated by a double colon (::). Formally, this is known as a namespace which is essentially a list of functions made visible by loading a package. If we’re wanting to be sure that we’re using filter() from dplyr, we could write dplyr::filter() instead of just using filter() and hoping that R figures it out. We won’t come across too many instances of this today, but the more packages we learn about, the more of a useful trick it becomes.

Given that this output is going to make our compiled document look a bit uglier, we can hide this type of output by setting the chunk argument message = FALSE. We can do this in the chunk header for this chunk only, or we can add it to our setup chunk, inside the call to knitr::opts_chunk$set() after echo = TRUE. We’ll have to separate these with a comma, but this will then become a global setting so that all messages from any chunks are hidden. This is my preferred option for all of my R Markdown documents. (Notice the use of the namespace trick here to ensure that opts_chunk$set() from the package knitr is called.)

Importing Data

Now we’ve started our R Markdown document and tidied up our output, let’s get on with importing data. First, we’ll use the convenient GUI that R Studio provides, but then we’ll ditch this and just use code in the style that we’ve already seen.

Using the GUI

The file we’ll work with today is called Enrichment.csv and contains some actual results from a transcriptomic analysis comparing mutant and wild-type zebrafish. This data was generously provided by Prof. Michael Lardelli who does some excellent work in the context of Alzheimer’s Disease. As this is a csv file, we know this will be a plain-text file, with columns separated by the comma symbol.

Using the Files pane, navigate into the data folder and click on the file Enrichment.csv. When prompted, select View File and the file will open in your script window, allowing us to have a sneak peak at it. Note, that we haven’t loaded it into your R Environment. We’re just opening it like we open any plain-text file to have a look at the contents. Once you’re happy with what it looks like as plain-text, close it by clicking the cross next to the file name and return to your Files pane.

Click on the file again, but this time select Import Dataset. A GUI will now open and you should see an image like the following:

If the Data Preview part of the preview is blank, click the Update button and it will refresh. (This seems a bit glitchy sometimes.) Don’t click Import just yet.

Notice that in the main Data Preview section that R Studio has figured out:

We have comma-separated values. The suffix .csv is how it guessed this.
The first row contains column names
What type of values are in each column. You can see the data type in brackets underneath the column names.

An vitally important feature of the GUI is in the bottom right, where we have been given a preview of the actual code that will be executed to import that data into R.

Here, there are three lines which we need to understand

library(readr)
Enrichment <- read_csv("data/Enrichment.csv")
View(Enrichment)

Using your mouse, select and copy these three lines then click the Import button and some crazy magic will happen. You’ll probably find yourself looking at an Excel-like preview, so before we move on, go to DataImport.Rmd and create a new chunk. Name the chunk import then paste these lines of code into the chunk.

Let’s see what we’ve actually done

`library(readr)`

The first line is telling R to load the package readr. This package contains the function read_csv() which is what parses the plain-text file into the R Environment. We’ve already loaded this package in our first chunk as part of the tidyverse, so we can actually delete this line from our import chunk. (Please do so.)

`Enrichment <- read_csv("data/Enrichment.csv")`

The second line is where the importing actually happens. Hopefully this looks like normal R code to you know, where we have an object name (Enrichment) followed by the assignment operator (<-). The data that we’re placing into this object is the output from the function read_csv(), and notice that here we have simply given the file path to this function, and it’s done the rest.

The GUI has automatically decided on the object name Enrichment as this was taken from the file name by dropping the .csv suffix.

`View(Enrichment)`

The third line has opened up our Excel-like preview. For those who really like this format, this can provide a familiar way to inspect your data. However, for those who are R purists, this actually looks horrifying! Once you’ve had a quick browse, close the preview and wash your eyes out after seeing such dirty things.

We’re also going to delete this third line from our import chunk. This preview is designed for interactive inspection of the data, and when we’re compiling our R Markdown document we simply can’t interact with the data like this.

An alternative way of viewing the data object is to simply enter the name of the object in the Console. Do this, and notice all of the information that has been presented to you here.

# A tibble: 500 x 9: This is a type of object known as a tibble, which is based on a data.frame, but wrapped in lovely wrapping paper so it looks nice. Notice that this output also tells us how many rows (500) and columns (9) that we have.
After that we have all of the column names, with the data type shown in grey underneath each column name.
Then we are shown a preview of the first 10 rows of data, followed by # … with 490 more rows. This is a great feature of tibbles, as a data.frame would dump all 500 rows onto the screen. Each column has also been truncated in the display so you get a peek at what every column looks like, without seeing all of the data. For this dataset, some of the columns are fairly wide so this also makes our lives easier.

These features are why many R users prefer this preview, as you can see all of that important summary information as well as having a sneak peak at the contents. Using the Excel-like View() command, in some people’s opinions, is actually less informative.

Another way to preview data

One more additional way to preview your data is to use the function glimpse(), which offers similar information to just typing the name of the tibble, but may yield a few more insights into the first few values. As you can see, this view doesn’t restrict the width of columns like the basic print() function does for a tibble.

glimpse(Enrichment)

Changing the Defaults

All of the settings for read_csv() were guessed by the R Studio GUI and for the most part, these will be correct. Can we change these using the GUI?

Re-open the GUI by clicking on the file, but once again do not click Import.

Try changing the name to enrichment at the bottom left, then click anywhere else in the GUI. You’ll notice that the Code Preview has updated to reflect this.
Uncheck the First Row as Names check-box. Clearly, this would be a bad idea for this dataset. However, notice that the column names have now changed, the types of data in each column have changes, and the code has changed to include the argument col_names = FALSE. Re-check the check-box to make sure we don’t accidentally do anything silly.
Uncheck the Open Data Viewer check-box. What changed?
Change the Delimiter: drop-down menu to anything other than a comma. Although this would also be a bad idea for this dataset, notice that the code changes along with the preview. This may be useful if you have tab-separated data, or if you have any other delimiter.
Click the Cancel button to make sure we import anything silly into our R Environment.

In our last change above, the import function changed to read_delim() instead of read_csv(). read_delim() is actually the main function whilst read_csv() is actually a wrapper with all arguments preset to values that work for a csv. Additional functions like read_tsv() do the same thing for tab-separated files. We’ll explore read_tsv() a little later in our session.

Using Code to Import Data

Returning to our R Markdown. As you may have now figured out, we’ve kept the only line from our code that performs the actual data import.

Enrichment <- read_csv("data/Enrichment.csv")

Have a look at your Environment Tab and you’ll see Enrichment sitting there. Now that we’ve saved our code for the import, click the broom symbol next to Import Dataset (). This will delete the object from your R Environment.

Now execute the chunk that has the import code, and it will reappear in your Environment. This is your first taste of reproducible research. We no longer have to click on everything, but can write the code to perform an operation.

After your import chunk, include the following lines.

The object Enrichment was imported, which contained `r nrow(Enrichment)` gene sets.

Now compile your document and check the document we have generated. It’ll be pretty simple, but we’ve effectively automated data import, and printed a quick summary of some important information.

What is in this object

In this object, we have 9 columns (gs_name, gs_size, nDE, wilkinson_p, FDR, fry, camera, gsea and goseq), each of which contains different pieces of information.

gs_name: This is the name of the gene set that we have tested for being enriched in our results
gs_size: This is the number of genes in the complete gene set
nDE: This is the number of genes in the gene set that were formally considered as differentially expressed (DE) between the two genotypes
wilkinson_p: Is our p-value for enrichment after combining multiple tests
FDR is an adjusted p-value which tells us what proportion of our results are likely to be false if we choose the corresponding value from wilkinson_p as our threshold of significance
fry, camera, gsea and goseq are four different analyses we performed on our data. We then combined these p-values (using Wilkinson’s method) to obtain the value wilkinson_p. (For reference, goseq needs a list of formally declared DE genes, whilst the other analyses work on ranked lists.)

We’re now going to explore a whole series of functions in the package dplyr which we can use to extract key results from our enrichment analysis.

The Package `dplyr`

Just like readr contains a series of functions designed to read data into R, the package dplyr contains a huge set of functions for working with tibble objects.

What is a `tibble`

The main object type we will work with today is a tibble, which is an extension of the basic R structure called a data.frame. In short, these are very much like spreadsheets with clear rows and columns, but there are a few key points we should just make a note of:

All values in a column must be of the same type (i.e. character, numeric, integer, logical), however, each column can be of a different type to the other columns
All columns must be the same length
Column names must be specified. If not, R will add generic column names like X1, X2 etc
Row names are optional.

This final point is one of the things that sets a tibble apart from a data.frame. Whilst a data.frame can have row names, a tibble cannot, and will drop them by silently setting them to the row number. This is a design feature enforced by Hadley Wickham, and can initially seem a little strange (and is much argued about amongst the R community). However it can provide a little more flexibility when dealing with non-trivial data structures. Other features which make tibble objects a slightly more convenient version of a data.frame are:

When you type the object name, only the first few rows are displayed. A data.frame will dump the entire contents into your R Console if you just type its name
Only the first few columns are displayed, with a summary of the remaining columns given. Once again, a data.frame will dump its entire contents into your R Console
The type of value contained in each column is shown
The dimensions of the object are shown

Beyond this, they are essentially identical structures and a tibble can just be thought of as a data.frame with pretty wrapping paper. Sometimes, you will hear tibble objects referred to as data.frame objects as this is their basic class, so hopefully this won’t cause any confusion.

Subsetting a `tibble/data.frame`

Regardless of which type of data.frame or tibble that you have, we can subset the object using square brackets after the object name, with rows and columns specified using object[row,column]. We could just print the entire first row of Enrichment using Enrichment[1,] Or we could print the first 3 columns using Enrichment[,1:3]. Combining the two to print the the first row and first 3 columns would be Enrichment[1, 1:3]

We can also call columns or rows by name, although given that tibbles have no rownames, rownames cannot be used to print specific rows from tibble objects. Note that when doing this for column names below, we use the c() function to combine the names we want as a character vector, enclosing the column name in qoutation marks.

Enrichment[1:5, c("gs_name", "nDE")]

In general, we encourage using column names, rather than the numeric positions. This can be more robust if you accidentally change the column order, and is also far easier to understand when other people (e.g. you in the future when you’ve forgotten what you did in the past) read your code.

Another common way to extract information from a single column is to use the $ symbol, and the square brackets. In this case, our column is treated as a vector and we only need to specify the position in the brackets, without a comma. In the following, we’ll just grab the first two entries in the gs_name column.

Enrichment$gs_name[1:2]

Although we haven’t formally mentioned it, using square brackets like the chunk above is also how you can subset a vector, like the hi vector we used last time. We will more formally discuss vectors in a later practical.

Introducing `dplyr`

The above methods are the traditional ways of accessing data within a tibble or data.frame. Another package written by Hadley Wickham is dplyr which introduces a variety of additional strategies for manipulating these objects, and is loaded as part of the tidyverse. These functions not only make sense to an Excel user, but enable extremely powerful approaches for data manipulation.

Using `select()` to select columns

Whilst we have already seen how to select columns using conventional strategies, dplyr introduces a new function select() which enables you to select columns. (Most functions in dplyr have intuitive names.) The strategy here is to pass our tibble/data.frame as the first argument to the function, and then to just list any columns we wish returned. Column names can be quoted, or given that they are columns within the data frame we’re working with, they can also be unquoted. We also no longer need to use the c() function to combine our column names.

select(Enrichment, gs_name, gs_size)

The above example of select() may also seem pretty simple, but given the helper functions which dplyr includes, this approach becomes extremely powerful. These common helper functions are: contains(), starts_with(), ends_with(), one_of() and everything().

An example of this may be just to extract the gs_name and gs_size columns.

select(Enrichment, starts_with("gs_"))

What would’ve happened if we’d just used starts_with("gs")?

Although everything() may initially seem pointless, we can also this within select() to simply rearrange our columns.

select(Enrichment, contains("DE"), everything())

If we wish to remove columns we just place the - symbol before the column names, or helper function.

select(Enrichment, -contains("DE"))

Notice that through the above calls to select(), we never overwrote our original object, because we never used the <- symbol to assign anything to a new object. select() simply takes a tibble/data.frame as input and returns the same. Although we will create new objects later today, this concept of returning an object of the same type as the input is an important one.

Using `arrange()` to sort columns

A common strategy in Excel is to sort your data by the values in a particular column. The function that dplyr introduces is arrange() and it behaves very similarly to select() in that we pass it the tibbel object, call the columns we wish to sort by and a tibble is returned. Similarly, the object will not be overwritten, but if a tibble is provided as input, a tibble will be returned as output.

By default, we have given you the object sorted by the wilkinson_p column. If we wish to sort by gene set size, we would sort on the gs_size column.

arrange(Enrichment, gs_size)

This will sort in increasing order by default, but to switch this to decreasing (or descending) order, we can wrap the column name in the desc() function.

arrange(Enrichment, desc(gs_size))

Now the largest gene sets are the first ones shown.

Using `filter()` to subset our results

Yet another common strategy in Excel is to use the auto-filter to restrict the values shown to those matching a given criteria. We can do this using the filter() function, which behaves similarly to the previous functions we’ve seen, but this time we need to perform a logical test on our column(s) of interest. Let’s just get the gene sets which contains at least one DE gene.

filter(Enrichment, nDE > 0)

We can also combine filtering criteria for more complex strategies. Here we can ask for the significant gene sets (FDR < 0.05) which contain at least one DE gene

filter(Enrichment, nDE > 0, FDR < 0.05)

Using `slice()` to obtain rows

Whilst filter() can be very useful for returning rows that match a given criteria, sometimes we do want to call rows by position. A classic example, may be if we just want to show the top 10 genes in a given list, and to do this, we can use the function slice().

slice(Enrichment, 1:5)

Importantly, we use filter() to filter rows based on a given criteria, but we use slice() to return rows based on their position in the original object.

Using `mutate()` to add columns

The final thing we might like to do with our tibble object is to add a column. To add columns using dplyr, we can use the function mutate().

In the following code, we’re adding a new column called PropDE by dividing the number of DE genes, by the total gene set size

mutate(Enrichment, prop_DE = nDE / gs_size)

If you can’t see the contents of this column, temporarily drag the R Studio panel divider so that the Console is wide enough to show all columns. Again, we haven’t yet overwritten our original object. We’ve just passed it to the function mutate() which has returned another tibble, but this time with one additional column.

The column gs_name contains gene sets from three sources. These are the GO data base, the KEGG data base and a list of gene sets from the MSigDB data base known as the HALLMARK gene sets.

We can use some of the functions we learned when we were exploring stringr to obtain this information.

mutate(
  Enrichment,
  source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")
)

Using `rename()`

One final function we’ll introduce in this section is rename(), which can be used to rename columns. (Surprise!) For this function, we provide the new name for each column that we wish to rename

rename(Enrichment, n_de = nDE)

Combining Functions

In the above sections, we have have seen many of the core functions in dplyr, however, this is really just scratching the surface of what we can do. Often we would like to combine some of these functions to produce some output. For example, we might like to arrange() our data by some key value, then just show (slice()) the top 10 values for our results table.

With our current knowledge, we would have two strategies available to us.

We save the output of the first function as a new object, then pass this as the input to the second function, or
Wrap the functions inside each other

If using strategy 1, we would end up with a workspace filled with similar objects, which would rapidly become a confusing mess. If using strategy 2, it would work, but it might look ugly and be hard to understand for your collaborators.

slice(
  arrange(Enrichment, desc(nDE)),
  1:5
)

Even doing this it’s pretty clear, that we’d probably want to remove some of the columns from our output table. We could do this, but even though I’ve tried to make the next chunk look readable, it’s still starting to get messy. And that’s just after performing two trivial things!

slice(
  arrange(Enrichment, desc(nDE)),
  1:5
)[,c("gs_name", "gs_size", "nDE", "wilkinson_p", "FDR")]

Introducing the `magrittr`

If you are familiar with bash, you may have come across the pipe as a way of chaining multiple functions together. If you’re not familiar with bash, you’re still going to learn a useful new trick.

R has a special tool called the magrittr (or sometimes called the pipe) which is the symbol %>%. What this does is takes what is given to it on the left, and places this as the first argument of a function on the right. It is very much like sticking a pipe at the output of one function, and directing the output to be the input of the next function, hence the name. (If you’d like to understand the ‘hilarious’ name, look here.) For example, the following two lines of code are functionally identical!

arrange(Enrichment, desc(nDE))
Enrichment %>% arrange(desc(nDE))

In the second line, we have taken the output of Enrichment (which is a tibble) and passed it to the function arrange() as the first argument. As you may have just figured out, we could also take the output of arrange() (which is also a tibble) ** and pass it to another function, such as slice().

Enrichment %>% 
  arrange(desc(nDE)) %>%
  slice(1:5)

Notice how this is much easier to read than our previous attempt! As we join several functions together, it’s also very good practice to place each function on it’s own line. This greatly assists readability, particularly as you will often end up with chains of 10-20 operations. R Studio will also indent each line for you automatically so you can easily see where the chain starts and finishes. (As you learn, please note the indentation styles we use in these practicals. They are explicitly included to teach you good practice and make your code readable.)

In our earlier attempt, we also just grabbed the columns we wanted to, and now we could add an extra call to select() to manage that.

Enrichment %>% 
  arrange(desc(nDE)) %>%
  slice(1:5) %>%
  select(starts_with("gs_"), nDE, wilkinson_p, FDR)

Task

Try and write a code chunk that performs the following:

Restrict the results to significant gene sets, using an FDR below 0.05 as your criteria
Restrict this list to gene sets with at least 1 DE gene
Find the proportion of genes in the gene set which are DE
Sort them in descending order, based on the proportion of DE genes
Provide the final output as the gene set name, the gene set size, the proportion of DE genes and the FDR
Rename the columns nDE and FDR to be n_de and fdr respectively.

Grouping values

Given that we have data from three sources, we might wish to obtain some summaries for each of these. dplyr provides the function group_by() to allow this. First let’s obtain the information describing the source of the gene set, as we did above, but then we should have a look to make sure we have everything right.

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  select(gs_name, source) %>%
  print(n = 20)

Let’s group by this new column and see what happens

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  group_by(source)

As may have noticed, nothing happened, except for one thing. If you check the top of the output, you’ll see # Groups: source [3], which tells us that R is now keeping track of the 3 groups in the column source. We can pass this to the function tally(), which will count how many rows belong to each group.

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  group_by(source) %>%
  tally()

Perhaps we may even wish to find how many gene sets were considered significant using come predefined cutoff. Let’s try using an FDR-adjusted p-value < 0.05 to define significant.

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  dplyr::filter(FDR < 0.05) %>%
  group_by(source) %>%
  tally()

Using `summarise()`

Maybe we might like to find the lowest FDR-adjusted p-value for each group.

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  group_by(source) %>%
  summarise(lowest = min(FDR))

We could even find the distribution of gene set sizes for each source

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  group_by(source) %>%
  summarise(
    smallest = min(gs_size),
    mean = mean(gs_size),
    largest = max(gs_size)
  )

Using `summarise_at()`

Once we start wanting to apply multiple functions to one or more columns, we can use the function summarise_at(). We can use our unquoted column names or helper functions to select the columns, but they do need to be wrapped in another function vars(). This seems a bit clunky, but it does the job. Also notice that when we specify the functions, we provide names that become column names.

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  group_by(source) %>%
  summarise_at(
    vars(gs_size), 
    c(smallest = min, mean = mean, largest = max)
  )

Although we won’t explore them today, there are a series of functions that end in _at such as rename_at(), mutate_at() and filter_at() which can be very powerful, but as the syntax can get clunky, we’ll save these until later. There are also a whole series of of related functions ending in _all and _if which can perform complex operations across all of your data, or just key columns. Once again, we do need to know a little more than we do know to really exploit these.

Joining Data From Multiple Sources

So far in this session, we’ve just explored some of the functionality of dplyr when working one object. Combining data from multiple sources is a very common requirement regardless of your context, and once again, dplyr provides some tools for helping with this. In our existing results, we have our results for each gene set, but we don’t actually know which genes are DE. We know how many, but not what they are.

Let’s load in a separate file with this information. In your main data folder in your VM (where we place all of your files), locate the file de.tsv and copy this to the same folder as Enrichment.csv. Note that this time, we have a tab-separated file, instead of a comma-separated file. If you click on this file and hope that R Studio is going to give you options (like we was with a csv), you’re out of luck. For some reason, loading of tsv files via the GUI is not implemented yet, so we’re going to have to write some code ourselves. Fortunately, it’s pretty easy.

Before we assign the imported data to an object, let’s get our importing code figured out. The function we’ll need here is read_tsv() instead of read_csv(), so let’s start with that.

```{r de}
read_tsv()
```

The first argument required by read_tsv() is called file, so we need to tell read_tsv() where our file is.

File Paths

This is as good a moment as any to introduce file paths. When we compile an R Markdown document into it’s final form (i.e. pdf, html etc.), all file paths are relative to the directory that the R Markdown document is in. We should have our R Markdown file in the project root directory, so it’ll be pretty easy today, but this is not always the case. To find where R is currently looking, as your working directory, enter the command getwd() directly into your Console. If all goes according to plan, this will print the complete path to your working directory, and this should begin with /home/student/, which may remember is often shortened to ~ as your home directory. (Every other user will have a home directory in /home too.)

On Linux and macOS, the / symbol at the start of a file path is the root of your file system. Windows users more commonly think of this as C:\\ and this represents a fundamental difference between the two types of architecture. So in this segment of the file path, the file system root is / with the first-level directory being home and the next level directory being student. An example directory tree might look like the following:

/ 
|-- bin
|-- etc
|-- home
|    | -- student
|    | -- ubuntu
|-- lib
|-- opt
|-- usr

Returning to our working directory, after /home/student/, you should see transcriptomics/practical_2 and this is the complete path to your working directory.

We can look in our current directory using list.files(), so hopefully that will give you your markdown document(s) and the data folder. If we wanted to look up a directory, we could try list.files("..") and that will step back into /home/student/transcriptomics and you should see the directories practical_1 and practical_2 there. We can also look into our data folder using list.files("data"), and that should give us the two files Enrichment.csv and de.tsv.

This is how we can navigate around from our working directory, and if your R Markdown file is going to be compiled from your project root directory, you’re likely to never have any dramas. However, when we collaborate with others it can get a bit more tricky. It’s very possible, that we provide "data/Enrichment.csv" as "/home/student/transcriptomics/practical_2/data/Enrichment.csv", and if you then sent that to an tutor who is going to mark your assignment, that folder probably won’t exist on their machine. If your tutor is linda, her folder may be "/home/linda/transcriptomics/practical_2/data/Enrichment.csv", or it may be completely different.

There is a package called here on your VMs which helps us get around this. Within the package here is a function of the same name (here()) which looks for the .Rproj file, and considers that directory as the project root. This enables portability, as by using here::here() on your system you’ll get "/home/student/transcriptomics/practical_2/, whilst Linda would get "/home/linda/transcriptomics/practical_2/, or the complete path to wherever she has her practical_2.Rproj file.

If we then provide a further directory to here::here(), we can look down into sub-directories with ease. For example, here::here("data") would provide the complete path to the data directory for practical 2 in a machine-agnostic fashion, as long as you are working with an R project called practical_2. This is one of the reasons the R projects are helpful, by doing virtually nothing but existing.

The full path to de.tsv can be specified as here::here("data", "de.tsv") or here::here("data/de.tsv"). The first of this is the most powerful as it will insert the directory separator as / for Linux/macOS, or \ for Windows. However, R installations on Windows are now able to correctly interpret file paths which use the Linux/macOS directory separator. This hasn’t always been the case.

If you chose the second option, you have also noticed RStudio helps you complete the file path. If your working directory whilst you’re entering your code is the project root, this can work very well. If your working directory whilst your entering your code is NOT the project root, hopefully now you’ll have the skills to figure it out.

Importing `de.tsv`

We can specify our path to de.tsv using either of these methods. Our import command should now look like this

```{r de}
de <- read_tsv(here::here("data/de.tsv"))
```

Alternatively, you could write

```{r de}
de <- here::here("data/de.tsv") %>% 
    read_tsv()
```

As we now know, these are equivalent expressions and you may find one or the other easier to read back in the future. Using read_tsv() for this dataset is dead easy, because we didn’t want to confuse you any further just yet.

If we look at our de object, you’ll notice that we 4007 gene sets with the gene set name in the first column. The list of DE genes associated with each gene set is in the second column, separated by a semicolon (;). Clearly we have far more gene sets here than in our Enrichment object, so how do we put the right genes in the right place?

Using the `*_join()` Family of Commands

`left_join()`

The most common strategy is to have one object (e.g. Enrichment) to which we’d like to add further information. Here we’d like to add the names of the DE genes from de to the Enrichment object, but we have data structures that clearly don’t have matching dimensions. This is a classic use-case for the function left_join(), which treats the first of two objects as the scaffold and places the contents of the second of two objects into the first, joining on any common columns. Our common column here is gs_name and left_join() will automatically detect this as it’s first step.

Enrichment %>%
  left_join(de)

Notice that left_join() has figured out where everything goes in a few milliseconds. However, there are some missing values. If we check the nDE column there are multiple gene sets where no genes were formally considered as DE, so these will be missing from the de objects.

Just so you believe us, try looking for the gene set "GO_COPI_VESICLE_COAT" which should be the second row of Enrichment.

filter(Enrichment, str_detect(gs_name, "GO_COPI_VESICLE_COAT"))
filter(de, str_detect(gs_name, "GO_COPI_VESICLE_COAT"))

`right_join()`

Where data points are missing from the second object (i.e. de), left_join() will add NA (for not available) in the additional columns being added to the first object. The reverse of this would be right_join() where the second object is treated as the scaffold.

Enrichment %>% 
  right_join(de)

At this point, the results are identical to de %>% left_join(Enrichment), however in the real world we may use a join part of the way through a series of commands piped together using the magrittr (%>%), so having both available is very useful.

`full_join()`

Two final functions in the *_join() family which we’ll look at are full_join() and inner_join(). full_join() will join both objects together, but where data is missing in one, NA values are placed in the corresponding column.

Enrichment %>%
  full_join(de)

In general, the structure of the first object will be used as a scaffold, with the additional rows added at the end.

Enrichment %>%
  full_join(de) %>%
  tail()

`inner_join()`

By contrast the function inner_join() will only return rows which are complete. Any rows which would contain a missing value are simply dropped.

Enrichment %>%
  inner_join(de)

Task

The transcription factor stat5 may be playing a key role in any inflammation for this experiment. Find all significantly enriched gene sets containing this gene.

Alternatively, we think the surface markers cd40 and cd59 might be of interest. Do they belong to any significantly enriched gene sets?

(Try perform this by searching for both of these within the same call to filter().)

`bind_rows()`

Two additional commands are bind_cols() and bind_rows(). Usage of the first is strongly discouraged as the data points in one column are not checked for matching values in any subsequent columns. There is usually a better way of doing things if you’re tempted to use this.

In contrast, bind_rows() is incredibly useful as this checks for common column names. A foolish thing to do might be to use bind_rows() to join Enrichment and de, however this is a useful technical example for demonstrating the bind_rows() command in action.

The common column will be gs_name, so the tibbles will be stacked on top of each other using this column. No DE column exists in Enrichment, so NA will be used to fill this column for the values originally in Enrichment. Similarly the columns gs_size, nDE, wilkinson_p, FDR, fry, camera, gsea and goseq don’t exist in de, and so these columns will be given NA for any rows originating from de. You’ll see however, that the total size of the joined tibble is 4507, which is the size you would expect by simply placing one of these after the other.

Enrichment %>% bind_rows(de)

An alternative, but far more realistic example of this may be if we returned to some of our group_by() operations. Earlier, we found how many gene sets from the GO, KEGG and HALLMARK databases were considered as enriched.

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  dplyr::filter(FDR < 0.05) %>%
  group_by(source) %>%
  tally()

Something you may find useful, is to add a summary row adding these.

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  dplyr::filter(FDR < 0.05) %>%
  group_by(source) %>%
  tally() %>%
  bind_rows(
    tibble(
      source = "**Total**",
      n = 15
    )
  )

Here we’re trying to create a new tibble to join onto the first one, and this should have the columns source and n so the two tibbles match. We’ve given the first column the value “Total” and highlighted it as bold for R Markdown (we’ll render this in html soon). In the second column, we manually entered the value 15, because we’d looked at the first result & figured that 15 = 14 + 1.

We can actually grab this from the data using a slightly advanced trick where we can look into the object passed by the magrittr using .. As a simple example of this technique, we can use a subsetting operation ([r,c]). The following is a sightly less elegant way to replicate Enrichment %>% slice(1:5)

Enrichment %>%
  .[1:5, ]

Here we’ve used . to indicate the object passed by %>% and we’ve then performed the subsetting operation on this. If we apply the same logic to our actual problem at hand, we can use:

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  dplyr::filter(FDR < 0.05) %>%
  group_by(source) %>%
  tally() %>%
  bind_rows(
    tibble(
      source = "**Total**",
      n = sum(.$n)
    )
  )

This time, we’ve used the $ separator to grab the n column from ., and then we’ve added the values using sum(). This is about as mind-blowing as we’ll get this week, so hopefully this hasn’t caused too many explosions in your brain. It’s a very useful trick though.

Making Pretty Output Using `pander`

So far, every piece of output we’ve generated has just used a standard tibble output, which should render in your output as standard R output, with ## at the start of each line. If we’re compiling to html, there is a very handy package called pander which will take a data.frame or tibble and produce a markdown table. At the top of your R Markdown, please add the line

library(pander)

This will load this when you compile, so execute the chunk to ensure it’s active in your interactive session too. We can now take the output of our previous chunk and pipe this into the function pander().

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  dplyr::filter(FDR < 0.05) %>%
  group_by(source) %>%
  tally() %>%
  bind_rows(
    tibble(
      source = "**Total**",
      n = sum(.$n)
    )
  ) %>%
  pander()

Notice that now we have a table in pure markdown which will recognised as we compile to an html as a table. Try compiling to see how this renders.

We can simply modify this output to add a caption, or align each column. Column alignments are set using the argument justify with "l" indicating left alignment and "r" indicating right. (Predictably, centre alignment is "c")

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  dplyr::filter(FDR < 0.05) %>%
  group_by(source) %>%
  tally() %>%
  bind_rows(
    tibble(
      source = "**Total**",
      n = sum(.$n)
    )
  ) %>%
  pander(
    justify = "lr"
  )

To add a caption, we provide this to pander() using the argument caption.

Enrichment %>%
  mutate(source = str_extract(gs_name, "(GO|KEGG|HALLMARK)")) %>%
  dplyr::filter(FDR < 0.05) %>%
  group_by(source) %>%
  tally() %>%
  bind_rows(
    tibble(
      source = "**Total**",
      n = sum(.$n)
    )
  ) %>%
  pander(
    justify = "lr",
    caption = "Summary of enriched gene sets, and which database they were drawn from."
  )

There are multiple packages which can perform this kind of thing (xtable for $\LaTeX$, or kableExtra for some alternative html formatting options). In general, pander provides a reasonably powerful and straightforward way of formatting html output, so hopefully this gives you enough options to work with for now. However, it may be worth noting that pander will print the entire tibble you pass to it, so make sure you’ve used filter() or slice() to just return the rows you care about can be very important.

A final word on `dplyr`

The functions in dplyr are very useful and form the backbone of nearly every analysis, transcriptomic or otherwise. However, their names are very commonly used by other packages (e.g. filter(), select(), mutate(), arrange(), slice()). Sometimes you may see some unintelligible error with something like

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘select’ for signature ‘"spec_tbl_df"’

Error in .Call2("Rle_constructor", values, lengths, PACKAGE = "S4Vectors") : 
  Rle of type 'list' is not supported

This can be your first clue that you have two packages loaded which both use the same function names. If you call a function from dplyr and you see a very odd error instead of your expected results, you can explicitly call the dplyr version of the function using the namespace from the package. This would look like dplyr::select(), or dplyr::filter() etc. Even though dplyr is loaded with a group of packages when we loaded the tidyverse, knowing that these functions come from dplyr is important, as that is the ‘namespace’ we need.

If you’re troubled by the word namespace, this is simply a collection of function names that are made publicly visible when you load a package. R has a way of holding all of these in memory separately for each package after loading, so that you can refer back to them easily in these kinds of situations.

We quietly introduced this concept earlier when we used here::here(), and you’ve also seen this before in your setup chunk using knitr::opts_chunk$set(), so hopefully this isn’t too weird. We’re mentioning it here, as many packages used in transcriptomic analysis will clash with dplyr exactly as we’ve described.

The problem itself arises with R’s very clumsy approach to method dispatch, where an object type is recognised and the appropriate version of a function is called. We’ll learn more about S3/S4 dispatch later as that is well beyond the scope of any introductory sessions. However, if you understand object-oriented programming, S4 object classes are R’s attempts at OOP. data frame and tibble objects are the standard S3 objects, which are all based on the original coding paradigm from S in the 1970s.

The methods of dispatch are apparently very difficult to resolve, so when you see these errors, it really means there is an S4 method which has over-written an S3 method in your current R environment’s hierarchy of methods. Honestly, you’d think someone would’ve figured a viable solution by now, but it appears to be too great a challenge to the entire R community. And contrary to popular belief (LOL), there are some very intelligent programmers in the R community.

This is one of the features that draws much disdain from the critics of R, and justifiably so. In contrast, languages such as golang (or go, a Google-designed language at the heart of much of their software) enforce the use of namespaces for every function call. There is considerable merit to this as it simply sidesteps this entire problem, and adds greater clarity to code. However, golang is not in the situation of trying to ensure that code from 20-25 years ago is still theoretically executable, which R (foolishly) attempts. go simply didn’t exist that long ago so doesn’t have to juggle design decisions from different programming paradigms and eras. We have learned a great deal about intelligent software design since the 1970s, and R is saddled with many problems from it’s “ancient” origins that have often been “solved” poorly. If you know how to work around them when they crop up, R can still be an incredibly powerful tool.

If you’re interested in the historical popularity of programming languages, have a look at this video that goes back to even before Steve and Dan were born. R is currently one of the 10 most heavily-used languages, so it clearly offers a way forward for many data scientists. Hands up who’s even heard of APL, Ada or Lisp?

(Actually, they’re still in use…)

And for the real nerds, check out the APL keyboard in heavy use in the 1970s, and which contributed significantly to the design of S (which became R). Have a look in the second row of keys, at the second to last key. This is where the R assignment operator came from, and it was originally one key! How cool is history?

BIOINF3005/7160: Transcriptomics Applications

Using the Tidyverse

11th March, 2020

Introduction

Setting up our workspace

Introducing The Tidyverse

Importing Data

Using the GUI

library(readr)

Enrichment <- read_csv("data/Enrichment.csv")

View(Enrichment)

Another way to preview data

Changing the Defaults

Using Code to Import Data

What is in this object

The Package dplyr

What is a tibble

Subsetting a tibble/data.frame

Introducing dplyr

Using select() to select columns

Using arrange() to sort columns

Using filter() to subset our results

Using slice() to obtain rows

Using mutate() to add columns

Using rename()

Combining Functions

Introducing the magrittr

Task

Grouping values

Using summarise()

Using summarise_at()

Joining Data From Multiple Sources

File Paths

Importing de.tsv

Using the *_join() Family of Commands

left_join()

right_join()

full_join()

inner_join()

Task

bind_rows()

Making Pretty Output Using pander

A final word on dplyr

11^th March, 2020

`library(readr)`

`Enrichment <- read_csv("data/Enrichment.csv")`

`View(Enrichment)`

The Package `dplyr`

What is a `tibble`

Subsetting a `tibble/data.frame`

Introducing `dplyr`

Using `select()` to select columns

Using `arrange()` to sort columns

Using `filter()` to subset our results

Using `slice()` to obtain rows

Using `mutate()` to add columns

Using `rename()`

Introducing the `magrittr`

Using `summarise()`

Using `summarise_at()`

Importing `de.tsv`

Using the `*_join()` Family of Commands

`left_join()`

`right_join()`

`full_join()`

`inner_join()`

`bind_rows()`

Making Pretty Output Using `pander`

A final word on `dplyr`