20 July 2016

Data Manipulation

Data Manipulation

Also known as data munging

We'll cover:

  • SQL- and Excel-like functions in dplyr
  • Changing from wide to long form using reshape2
  • Editing text using stringr

Data Manipulation

Loading The Packages

library(readr)
library(dplyr)
library(tibble)
  • The package dplyr works specifically with data.frame objects
  • Works optimally with local data frame aka tbl_df objects
  • Have just been renamed tibble objects

But First

Logical Tests

  • Is Equal To: ==
  • Not equal - !=
  • OR - |
  • Less than <
  • Less than or equal <=

Logical Tests

x <- 1:10
x == 5
x !=5
x > 5
x > 5 | x == 2
  • R also recognises the symbol & for AND
  • Not relevant for dplyr

Starting with dplyr

Data for this session:

data <- read_csv("data/comments.csv", comment = "#")
data

So we have a 32 x 6 data frame

dim(data)
nrow(data)
ncol(data)

Starting with dplyr

The select() function

The first row is the row names from the original file.

How can we remove this column?

The function select() allows you to select columns by name

select(data, gender, name, weight, height, transport)

Starting with dplyr

The select() function

The first row is the row names from the original file.

How can we remove this column?

The function select() allows you to select columns by name

select(data, gender, name, weight, height, transport)

Or by position

select(data, 2:6)

Starting with dplyr

The select() function

  • We can also remove columns using the minus (-) sign
select(data, -1)
  • As this column has no name, we can't call/remove it by name.
  • We can remove other columns by name
select(data, -transport)

Discuss: Would removing by name or position by the best?

The select() function

The select() function has a few bonus functions:

  • starts_with(), ends_with(), contains(), one_of() and everything()
select(data, ends_with("t"))
select(data, contains("eig"))

The select() function

So far, we haven't changed the original object

We can overwrite this anytime (sometimes by accident)

data <- select(data, -1)
data

Now we have removed the meaningless columns

  • To get the column, we just need to reload the .csv file

Using filter() and arrange()

We can use our logical tests to filter the data

filter(data, transport == "car")
filter(data, transport == "car", gender == "female")

We can sort on one or more columns

arrange(data, weight)
arrange(data, desc(weight))
arrange(data, transport, height)

Combining Functions

  • This is where dplyr steps up a gear
  • We can chain functions together using %>%
  • This behaves like a | in the bash shell
  • Known as the magrittr

Combining Functions

data %>% filter(transport == "bike")
data %>% filter(transport == "bike") %>% arrange(weight)

There is no limit to the number of functions you can chain together

Combining Functions

For the technically minded

Each function in dplyr takes a data.frame as the first argument

The magrittr pipes our (modified) data.frame into the next function as the first argument

Adding extra columns

We can add extra columns using mutate()

data %>% mutate(height_m = height/100)

Once we've added a column, we can refer to it by name

data %>% mutate(height_m = height/100, BMI = weight / height_m^2)

We can also overwrite existing columns

data %>% mutate(height = height/100)

Have we changed the original data.frame?

Changing Column Names

Can use the function rename()

data %>% rename(height_cm = height)

Now we can get crazy

data %>%
  rename(height_cm = height) %>%
  mutate(height_m = height_cm/100,
         BMI = weight / height_m^2) %>%
  filter(BMI > 25)

Getting Summaries

Again, this is where dplyr really makes it easy.

data %>% summarise(mean(weight), mean(height))
data %>% summarise_each(funs(mean, sd), ends_with("ght"))

Getting Group Summaries

We can group categorical variables by their levels

data %>%
  group_by(gender) %>%
  summarise_each(funs(mean), ends_with("ght"))

Getting Group Summaries

Or combinations of levels

data %>%
  group_by(gender, transport) %>%
  summarise_each(funs(mean), ends_with("ght"))

We can use any function that spits out a single value

  • sd(), min(), median()
  • Plus the function n()

Getting Group Summaries

data %>%
  group_by(gender, transport) %>%
  summarise(mn_weight = mean(weight),
            mn_height = mean(height),
            n = n())

Reshaping your data

Reshaping your data

This dataset is in what we refer to as wide form

  • We have a row of measurements for each individual
  • The information is structured around the individual
  • In long form, the information is structured around the measurement

Reshaping your data

From Wide to Long

library(reshape2)
library(readr)
wideData <- read_csv("data/wide.csv")

This is a time course for two treatments

melt(wideData, id.vars = c("Name", "Tx"))

Many functions require data to be in this format

Reshaping your data

From Wide to Long

We don't need to leave those names as variable and value

wideData %>%
  melt(id.vars = c("Name", "Tx"),
       variable.name = "Day", 
       value.name = "Change") 

Reshaping your data

  1. How could we get means for each treatment/day from the original data?
  2. How can we get the same from the data after "melting"?

Reshaping your data

1 How could we get means for each treatment/day from the original data?

wideData %>% 
  group_by(Tx) %>%
  summarise_each(funs(mean), starts_with("day"))

2 How can we get the same from the data after "melting"?

wideData %>%
  melt(id.vars = c("Name", "Tx"),
       variable.name = "Day", 
       value.name = "Change") %>%
  group_by(Tx, Day) %>%
  summarise(mn_change = mean(Change))

Reshaping your data

From Long To Wide

Let's save that last summary data.frame

wideMeans <- wideData %>%
  melt(id.vars = c("Name", "Tx"),
       variable.name = "Day", 
       value.name = "Change") %>%
  group_by(Tx, Day) %>%
  summarise(mn_change = mean(Change))

Reshaping your data

From Long To Wide

We can change from long to wide using the formula syntax

  • "~" stands for is a function of, or depends on
  • The function dcast uses it to define rows on the LHS and columns on the RHS
dcast(wideMeans, Tx~Day)
dcast(wideMeans, Day~Tx)

Reshaping your data

From Long To Wide

Would we ever use both?

pcr <- read_csv("data/PCR.csv")

Here we have 3 genes being measured in 2 cell types, across 3 time-points

The first part is easy:

melt(pcr, id.vars = "Gene", variable.name = "CellType", value.name = "Ct")

Text Manipulation

We might like to split those names into the two cell types and 3 time-points.

  • Use the functions str_extract() and str_replace()
library(stringr)
pcr %>%
  melt(id.vars = "Gene",
       variable.name = "CellType", value.name = "Ct") %>%
  mutate(TimePoint = str_extract(CellType, "(0hr|12hr|24hr)"),
         CellType = str_replace(CellType, "_(0hr|12hr|24hr)", "")) 

Save this as the object pcr_long

Text Manipulation

Now we could make it wide (but not as wide)

pcr %>%
  melt(id.vars = "Gene", 
       variable.name = "CellType", value.name = "Ct") %>%
  mutate(TimePoint = str_extract(CellType, "(0hr|12hr|24hr)"),
         CellType = str_replace(CellType, "_(0hr|12hr|24hr)", "")) %>%
  dcast(Gene + CellType ~ TimePoint, value.var = "Ct")

Challenge

To investigate the effect of fitness, gender, country of origin on red blood cell count (RBC), you have been given challenge.csv with observations made on 500 subjects. The variables are:

  1. gender: male or female.
  2. fitness: a measure of fitness lying between 0 and 100.
  3. country: there are three countries denoted by 1, 2, or 3.
  4. RBC: red blood cell count in millions/\(\mu\)L.

Clean the data entirely in R