27 November 2016

Data Manipulation

Data Manipulation

Also known as data munging

We'll cover:

  • Working with data.frame objects (or tibbles)
  • SQL- and Excel-like functions in dplyr
  • Changing from wide to long form using reshape2

Data Manipulation

Why?

  • Often dealing with summary type information
    • Lists of differentially expressed genes
    • Summaries of read counts across fastq libraries
  • I spend more time doing this than actual "genomics"

Data Manipulation

Loading The Packages

  • Create a new script file called Using_dplyr.R
  • Add the following as the first three lines
  • A common (and good) practice is to start a script by loading the packages
library(readr)
library(dplyr)
library(tibble)

Data Manipulation

Loading The Packages

  • The package dplyr has functions to work specifically with data.frame objects
  • Works optimally with local data frame aka tbl_df objects
  • Have just been renamed tibble objects

But First

Logical Tests

  • Is Equal To: ==
  • Not equal - !=
  • OR - |
  • Less than <
  • Less than or equal <=

Logical Tests

x <- 1:10
x == 5
x != 5
x > 5
x > 5 | x == 2
  • R also recognises the symbol & for AND
  • Not very relevant for dplyr

Logical Tests

  • We could also find the positions of the values which are TRUE in the previous tests
which(x >5)

Starting with dplyr

Data for this session:

data <- read_csv("data/comments.csv", comment = "#")
data

So we have a 32 x 6 data frame

dim(data)
nrow(data)
ncol(data)

Starting with dplyr

The select() function

The first row is the row names from the original file.

How can we remove this column?

(We could use data[,2:5])

Starting with dplyr

The select() function

The first row is the row names from the original file.

How can we remove this column?

The function select() allows you to select columns by name

select(data, gender, name, weight, height, transport)

Starting with dplyr

The select() function

The first row is the row names from the original file.

How can we remove this column?

The function select() allows you to select columns by name (good)

select(data, gender, name, weight, height, transport)

Starting with dplyr

The select() function

Or by position (not so good)

select(data, 2:6)

Discuss: What are the merits of calling by name or position?

Starting with dplyr

The select() function

  • We can also remove columns using the minus (-) sign
select(data, -X1)
  • We can remove any other columns by name
select(data, -transport)

The select() function

The select() function has a few bonus functions:

  • starts_with(), ends_with(), contains(), one_of() and everything()
select(data, ends_with("t"))
select(data, contains("eig"))

The select() function

So far, we haven't changed the original object

We can overwrite this anytime (sometimes by accident)

data <- select(data, -X1)
data
  • Now we have removed the meaningless columns

  • To get the column back, we just need to reload the .csv file using our command we've saved in the script

Using filter() and arrange()

We can use our logical tests to filter the data

filter(data, transport == "car")
filter(data, transport == "car", gender == "female")

We can sort on one or more columns

arrange(data, weight)
arrange(data, desc(weight))
arrange(data, transport, height)

Combining Functions

  • This is where dplyr steps up a gear
  • We can chain functions together using %>%
  • This behaves like a | in the bash shell
  • Known as the magrittr

Combining Functions

data %>% filter(transport == "bike")

And then sort based on weight

data %>% filter(transport == "bike") %>% arrange(weight)

There is no limit to the number of functions you can chain together

Combining Functions

For the technically minded

Each function in dplyr takes a data.frame as the first argument

The magrittr pipes our (modified) data.frame into the next function as the first argument

?select

Adding extra columns

We can add extra columns using mutate()

data %>% mutate(height_m = height/100)

Once we've added a column, we can refer to it by name

data %>% mutate(height_m = height/100, BMI = weight / height_m^2)

We can also overwrite existing columns

data %>% mutate(height = height/100)

Have we changed the original data.frame?

Changing Column Names

Can use the function rename()

data %>% rename(height_cm = height)

Now we can get crazy

data %>%
  rename(height_cm = height) %>%
  mutate(height_m = height_cm/100,
         BMI = weight / height_m^2) %>%
  filter(BMI > 25)

Getting Summaries

Again, this is where dplyr really makes it easy.

data %>% summarise(mean(weight), mean(height))
data %>% summarise_each(funs(mean, sd), ends_with("ght"))

Getting Group Summaries

We can group categorical variables by their levels

data %>%
  group_by(gender) %>%
  summarise_each(funs(mean), ends_with("ght"))

Getting Group Summaries

Or combinations of levels

data %>%
  group_by(gender, transport) %>%
  summarise_each(funs(mean), ends_with("ght"))

We can use any function that spits out a single value

  • sd(), min(), median()
  • Plus the function n()

Getting Group Summaries

data %>%
  group_by(gender, transport) %>%
  summarise(mn_weight = mean(weight),
            mn_height = mean(height),
            n = n())

Reshaping your data

Reshaping your data

This dataset is in what we refer to as wide form

  • We have a row of measurements for each individual
  • The information is structured around the individual
  • In long form, the information is structured around the measurement

Reshaping your data

From Wide to Long

library(reshape2)
wideData <- read_csv("data/wide.csv")

This is a time course for two treatments

melt(wideData, id.vars = c("Name", "Tx"))

Many functions require data to be in this format

Reshaping your data

From Wide to Long

We don't need to leave those names as variable and value

wideData %>%
  melt(id.vars = c("Name", "Tx"),
       variable.name = "Day", 
       value.name = "Change") 

Reshaping your data

  1. How could we get means for each treatment/day from the original data?
  2. How can we get the same from the data after "melting"?

Reshaping your data

1 How could we get means for each treatment/day from the original data?

wideData %>% 
  group_by(Tx) %>%
  summarise_each(funs(mean), starts_with("day"))

2 How can we get the same from the data after "melting"?

wideData %>%
  melt(id.vars = c("Name", "Tx"),
       variable.name = "Day", 
       value.name = "Change") %>%
  group_by(Tx, Day) %>%
  summarise(mn_change = mean(Change))

Reshaping your data

From Long To Wide

Let's save that last summary data.frame

wideMeans <- wideData %>%
  melt(id.vars = c("Name", "Tx"),
       variable.name = "Day", 
       value.name = "Change") %>%
  group_by(Tx, Day) %>%
  summarise(mn_change = mean(Change))

Reshaping your data

From Long To Wide

We can change from long to wide using the formula syntax

  • "~" stands for is a function of, or depends on
  • The function dcast uses it to define rows on the LHS and columns on the RHS
dcast(wideMeans, Tx~Day)
dcast(wideMeans, Day~Tx)

Reshaping your data

From Long To Wide

Would we ever use both?

pcr <- read_csv("data/PCR.csv")

Here we have 3 genes being measured in 2 cell types, across 3 time-points

The first part is easy:

melt(pcr, id.vars = "Gene", variable.name = "CellType", value.name = "Ct")