Spring Into Bioinformatics

Introduction
Incorporating Data Manipulation With Plots

Introduction

Now we have learned how to make a few plots with ggplot2, to really get things moving we should learn how manipulate tibble objects to add extra columns, sort columns, perform filtering steps etc., just like we do with Excel.

Tibbles and Data Frames

Most of the objects we work with today are tibble objects, which are an extension of the basic R structure called a data.frame. In short, these are very much like spreadsheets with clear rows and columns, but there are a few key points we should just make a note of:

All values in a column must be of the same type (i.e. character, numeric, integer, logical)
Each column can be of a different type to the other columns
All columns must be the same length
Column names must be specified. If not, R will add generic column names
Row names are optional.

This final point is one of the things that sets a tibble apart from a data.frame. Whilst a data.frame can have row names, a tibble cannot, and will drop them. This can initially seem a little strange, but it does provide a little more flexibility when dealing with non-trivial data structures. Other features which make tibble objects a slightly more convenient version of a data.frame are:

Only the first few rows are displayed when you type the object name. A data.frame will dump the entire contents into your R Console if you just type its name
Only the first few columns are displayed, with a summary of the remaining columns given. Once again, a data.frame will dump its entire contents into your R Console
The type of value contained in each column is shown
The dimensions of the object are shown

Beyond this, they are essentially identical structures and a tibble can just be thought of as a data.frame with nice wrapping paper around it. Sometimes, you will hear tibble objects referred to as data.frame objects as this is their basic class, so hopefully this won’t cause any confusion.

Even though we have restricted our results to the genes on chromosome 1, we still have over 1000 genes in our dataset. If we wish to look as specific values, we can use the square brackets ([r,c]) just to pull out the rows and columns of interest. Rows go before the comma, whilst columns go after the comma. In the following, we’ll just grab the first row, and the first two columns.

DGE_Results[1, 1:2]

We can also call columns by name, although given the lack of row names for tibbles, we can only call rows by position (this only applies to tibble objects). Note that when doing this, we use the c() function to combine the column names we want.

DGE_Results[1, c("GeneID", "logFC")]

In general, we encourage using column names, rather than the numeric positions. This can be more robust if you accidentally change the column order, and is also far easier to understand when other people (e.g. you in the future when you’ve forgotten what you did in the past) read your code.

Another way to extract information from a single column is to use the $ symbol, and the square brackets. In this case, our column is treated as a vector and we only need to specify the position in the brackets, without a comma. In the following, we’ll just grab the first two gene identifiers.

DGE_Results$GeneID[1:2]

Introducing `dplyr`

The above methods are the traditional ways of accessing data from a tibble or data.frame. Another package written by Hadley Wickham is dplyr which introduces a variety of additional strategies for manipulating these objects. These not only make sense to an Excel user, but enable extremely powerful approaches for data manipulation. Let’s add the following to our loadPackages chunk:

library(dplyr)

Using `select()` to select columns

Whilst we have already seen how to select columns using conventional strategies, dplyr introduces a new function select() which enables you to select columns. The strategy here is to pass our tibble/data.frame as the first argument to the function, and then to just list any columns we wish returned. Column names can be quoted, or given that they are columns within the data frame we’re working with, they can also be unquoted. We also no longer need to use the c() function to combine our column names.

select(DGE_Results, GeneID, logFC)

This may seem pretty simple, but given the helper functions which dplyr brings with it, becomes extremely powerful. Common helper functions are: contains(), starts_with(), ends_with(), one_of() and everything().

An example of this may be just to extract the logFC and logCPM columns.

select(DGE_Results, starts_with("log"))

We can also use this to rearrange our columns.

select(DGE_Results, logFC, everything())

Notice that through the above calls to select(), we never overwrote our original object. select() simply takes a tibble/data.frame as input and returns the same.

Using `arrange()` to sort columns

A common strategy in Excel is to sort your data by the values in a particular column. The function that dplyr introduces is arrange() and it behaves very similarly to select() in that we pass it the object, then call the columns we wish to sort by. Similarly, the object will not be overwritten, but if a tibble is provided, a tibble will be returned.

By default, we have given you the object sorted by the PValue column. If we wish to sort by expression level, we would sort on the logCPM column.

arrange(DGE_Results, logCPM)

This will sort in increasing order by default, but to switch this to descending order, we can wrap the column name in the desc() function.

arrange(DGE_Results, desc(logCPM))

Now the most highly expressed genes are the first ones shown.

Using `filter()` to subset our results

Yet another common strategy in Excel is to use the auto-filter to restrict the values shown to those matching a given criteria. We can do this using the filter() function, which behaves similarly to the previous ones we’ve seen, but this time we need to perform a logical test on our column(s) of interest. Let’s just get the most highly expressed genes:

filter(DGE_Results, logCPM > 10)

We could also return those beyond a given logFC threshold, using the abs() function to filter based on absolute values (i.e. where the $\pm$ sign is ignored).

filter(DGE_Results, abs(logFC) > 5)

We can also combine filtering criteria for more complex strategies. Here we can ask for the significant genes (FDR < 0.01) which are down-regulated.

filter(DGE_Results, FDR < 0.01, logFC < 0)

Using `slice()` to obtain rows

Whilst filter() can be very useful for returning rows that match a given criteria, sometimes we do want to call rows by position. A classic example, may be if we just want to show the top 10 genes in a given list. To do this, we can use the function slice().

slice(DGE_Results, 1:5)

Importantly, we use filter() to filter rows based on a given criteria, but we use slice() to return rows based on their position in the original object.

Using `mutate()` to add columns

The final thing we might like to do with our tibble object is to add a column. A common strategy might be to add a column denoting whether we consider each gene to be differentially expressed by some individual, or combined criteria. To add columns using dplyr, we can use the function mutate().

In the following code, we’re adding a new column called Sig and adding a logical (or boolean) value based in the gene matching a combined criteria of an FDR-adjusted $p$-value < 0.01, and logFC beyond the range $\pm1$

mutate(DGE_Results, Sig = FDR < 0.01 & abs(logFC) > 1)

Again, we haven’t yet overwritten our original object. We’ve just passed it to the function mutate() which has returned another tibble, but this time with one additional column.

We can also create more than one column using this strategy, by combining our process with the function case_when(). Note that for better ‘readability’, we’re going to stagger our code over multiple lines here. When doing this indenting becomes very useful, and the shortcut Ctrl+I will do this for you.

mutate(
    DGE_Results,
    Sig = FDR < 0.01 & abs(logFC) > 1,
    Direction = case_when(
        Sig & logFC > 0 ~ "Up",
        Sig & logFC < 0 ~ "Down",
        !Sig ~ "Unchanged"
    )
)

Combining functions

In the above, we introduced our key functions from dplyr and have already started to perform a few relatively complex operations in a fairly intuitive manner. We also made a point of mentioning that every one of the above functions takes a tibble (or data.frame) as input and returns the same. This makes it very easy to chain together multiple operations and produce a series of simple steps to get our data in the form we need it to be. There is an additional function from the package magrittr which is imported by dplyr. For those of you familiar with the pipe symbol in bash, this is R’s equivalent operator. In short, the %>% function takes the output of one command, and ‘pipes’ it into the next command, strictly in the first position. By way of example.

filter(DGE_Results, abs(logFC) > 5)

Is identical to

DGE_Results %>% filter(abs(logFC) > 5)

When working like this, we often spread out our steps over multiple lines too, so the above can be expressed as:

DGE_Results %>% 
    filter(abs(logFC) > 5)

(Note the indenting)

Now we can chain together multiple operations by placing this after every function call. To get a list of our top 10 up-regulated genes we could do the following. First we’ll filter based on an FDR < 0.01, then we’ll restrict to the genes with logFC > 0. The we’re sorting by PValue, just to make sure the data is arranged how we think it is. Next, we’ll remove the column PValue as it doesn’t really add much useful information here. Note that we can do this by simply placing a - symbol before the column name. Finally, we’ll use slice to return the first 10 values

DGE_Results %>%
    filter(FDR < 0.01 & logFC > 0) %>%
    arrange(PValue) %>%
    select(-PValue) %>%
    slice(1:10)

We also introduced the package pander earlier, and this also has a default method for making tibble and data.frame objects look amazing in R markdown. In the following, we’re adding two arguments to the call to pander(). Firstly, we’re adding a caption, then we’re making sure our markdown table is left-aligned for the first column (text), the right-aligned for the numeric columns.

DGE_Results %>%
    filter(FDR < 0.01 & logFC > 0) %>%
    arrange(PValue) %>%
    select(-PValue) %>%
    slice(1:10) %>%
    pander(
        caption = "Top 10 up-regulated genes",
        justify = "lrrr"
    )

As you can see, this now enables a large variety of data manipulations to be performed, in a manner which is both intuitive to write, and relatively easy to read back. Much of the syntax for dplyr is based on SQL syntax, where a spreadsheet-type structure is referred to as a table. This is actually how the name tibble came about. They were originally called tbl_df objects to reflect that they mirror an SQL table but are also a data.frame. Over the years this became pronounced then formalised into tibble.

Repeat the above but for down-regulated genes

Incorporating Data Manipulation With Plots

Now we’ve learned how to add useful information to our DGE_Results object, this opens up a few useful ways to enhance our plots. A good strategy may be to indicate which genes are considered to be differentially-expressed on our MD plot. Our previous strategy of adding the column Sig would be a good approach here. Once we’ve done this, we can use this to plot our DE genes in a different colour. Again, we just need to specify colour as a plotting aesthetic.

DGE_Results %>%
    mutate(Sig = FDR < 0.01 & abs(logFC) > 1) %>%
    ggplot(aes(x = logCPM, y = logFC, colour = Sig)) +
    geom_point() +
    geom_hline(yintercept = c(-1, 1), colour = "blue", linetype = 2) +
    labs(x = "Average Expression (logCPM)") +
    theme_bw()

By default ggplot2 will assign a two colour category the two colours you see. This may not be what we’re after, so to change these colours, we use another ggplot2 function scale_colour_manual(), which we add as an additional layer. We can add this layer anywhere we choose, but a good strategy is to put any scale parameters between our geometry and theme layers.

DGE_Results %>%
    mutate(Sig = FDR < 0.01 & abs(logFC) > 1) %>%
    ggplot(aes(x = logCPM, y = logFC, colour = Sig)) +
    geom_point() +
    geom_hline(yintercept = c(-1, 1), colour = "blue", linetype = 2) +
    scale_colour_manual(values = c("grey40", "red")) +
    labs(x = "Average Expression (logCPM)") +
    theme_bw()

This isn’t a really informative legend, so if we wish to hide this we could use one of two approaches. We can turn this off using a guides() layer.

DGE_Results %>%
    mutate(Sig = FDR < 0.01 & abs(logFC) > 1) %>%
    ggplot(aes(x = logCPM, y = logFC, colour = Sig)) +
    geom_point() +
    geom_hline(yintercept = c(-1, 1), colour = "blue", linetype = 2) +
    scale_colour_manual(values = c("grey40", "red")) +
    guides(colour = FALSE) +
    labs(x = "Average Expression (logCPM)") +
    theme_bw()

Or we can turn this off by adding a new theme() layer. The previous method is useful if we only wish to hide specific aesthetics, while the theme() approach can be used to hide all legends.

DGE_Results %>%
    mutate(Sig = FDR < 0.01 & abs(logFC) > 1) %>%
    ggplot(aes(x = logCPM, y = logFC, colour = Sig)) +
    geom_point() +
    geom_hline(yintercept = c(-1, 1), colour = "blue", linetype = 2) +
    scale_colour_manual(values = c("grey40", "red")) +
    labs(x = "Average Expression (logCPM)") +
    theme_bw() +
    theme(legend.position = "none")

This latter approach can also be used to place the legend in alternative positions. Options can be legend.position = "bottom", or "right", "left" or "top".

We could take this idea even further for our volcano plot, where we could colour up and down regulated genes separately.

DGE_Results %>%
    mutate(
        Sig = FDR < 0.01 & abs(logFC) > 1,
        Direction = case_when(
            Sig & logFC > 0 ~ "Up",
            Sig & logFC < 0 ~ "Down",
            !Sig ~ "Unchanged"
        )
    ) %>%
    ggplot(aes(logFC, -log10(PValue), colour = Direction)) +
    geom_point() +
    scale_colour_manual(values = c("blue", "grey", "red")) +
    labs(y = "-log10(p)") +
    theme_bw()

The next step might be to add gene names, so let’s move to the next section.

Spring Into Bioinformatics

Working With Data In R

Steve Pederson

02/10/2019

Introduction

Tibbles and Data Frames

Introducing `dplyr`

Using `select()` to select columns

Using `arrange()` to sort columns

Using `filter()` to subset our results

Using `slice()` to obtain rows

Using `mutate()` to add columns

Combining functions

Incorporating Data Manipulation With Plots

Spring Into Bioinformatics

Working With Data In R

Steve Pederson

02/10/2019

Introduction

Tibbles and Data Frames

Introducing dplyr

Using select() to select columns

Using arrange() to sort columns

Using filter() to subset our results

Using slice() to obtain rows

Using mutate() to add columns

Combining functions

Incorporating Data Manipulation With Plots

Introducing `dplyr`

Using `select()` to select columns

Using `arrange()` to sort columns

Using `filter()` to subset our results

Using `slice()` to obtain rows

Using `mutate()` to add columns