1.2 Tidying Data

15 April 2019

Data Manipulation

Also known as data munging

We'll cover:

SQL- and Excel-like functions in dplyr
Changing from wide to long form using tidyr
Editing text using stringr

Data Manipulation

Loading The Packages

library(tidyverse)

The package dplyr works specifically with data.frame objects
Works optimally with tibble objects
library(tidyverse) loads dplyr, stringr and tidyr (+ others)

But First

Logical Tests

Is Equal To: ==
Not equal: !=
Or: |
Less than: <
Less than or equal: <=

Logical Tests

x <- c(1:5, NA)
x == 5
x != 5
x > 3
x > 3 | x == 2
is.na(x)

R also recognises the symbol & for AND
Not highly relevant for dplyr

The package `dplyr`

Starting with `dplyr`

Data for this session:

transport <- read_csv("data/transport_clean.csv")
transport

So we have a 32 x 5 data frame

dim(transport)
nrow(transport)
ncol(transport)

Starting with `dplyr::select()`

dplyr is part of the core tidyverse
The function select() allows you to select columns by name
The data_frame must be given as the first argument

select(transport, gender, weight, height, method)

Or by position

select(transport, 1:3)

Starting with `dplyr::select()`

We can also remove columns using the minus (-) sign

select(transport, -2)
select(transport, -name)

Discuss: Would removing by name or position by the best?

Starting with `dplyr::select()`

The select() function has a few helper functions

starts_with(), ends_with(), contains(), one_of() and everything()

select(transport, ends_with("ght"))
select(transport, contains("t"))

Starting with `dplyr::select()`

So far, we haven't changed the original object

We can overwrite this anytime (sometimes we will by accident)

transport <- select(transport, -name)
transport

Now we have removed the name column in our oject transport

Starting with `dplyr::select()`

To get the column back, we need to reload the .csv file

transport <- read_csv("data/transport_clean.csv")

(Ah, the joys of using code…)

Using `dplyr::filter()`

We can use our logical tests to filter the data
Again we need to specify the data_frame as the first argument

filter(transport, method == "car")
filter(transport, method == "car", gender == "female")

Using `dplyr::arrange()`

We can sort on one or more columns

arrange(transport, weight)
arrange(transport, method, height)

We can change to descending order using desc()

arrange(transport, desc(weight))

Combining Functions

This is where dplyr steps up a gear
We can chain functions together using %>%
This behaves like a | in the bash shell (Covered tomorrow)
From the package magrittr
Is always imported by dplyr

Places the output of the first function as the first argument of the next function!

Combining Functions

filter(transport, method == "bike")

Is identical to

transport %>% filter(method == "bike")

Combining Functions

But now we can chain together multiple operations!

transport %>% filter(method == "bike") %>% arrange(weight)

There is no limit to the number of functions you can connect together

For the technically minded

Each function in dplyr takes a data.frame as the first argument
Each function returns a data.frame

transport %>% filter(method == "bike") %>% arrange(weight)

The magrittr placed transport into the first argument of filter()
The output of filter() was a data.frame \(\implies\) this was placed as the first argument to arrange()

Using `dplyr::mutate()`

We can add extra columns using mutate()

transport %>% mutate(height_m = height/100)

Once we've added a column, we can refer to it by name

transport %>% 
  mutate(height_m = height/100, 
         BMI = weight / height_m^2)

Using `dplyr::mutate()`

We can also overwrite existing columns

transport %>% mutate(height = height/100)

Have we changed the original data_frame yet?

Using `dplyr::rename()`

Can use the function rename()

transport %>% rename(height_cm = height)

Now we can get crazy

transport %>%
  rename(height_cm = height) %>%
  mutate(height_m = height_cm/100,
         BMI = weight / height_m^2) %>%
  filter(BMI > 25)

Getting Summaries

We can get summaries for entire columns

transport %>% 
  summarise(mean(weight), mean(height))

Getting Summaries

Let's deal with that missing value:

transport %>%
  summarise(
      mn_wt = mean(weight, na.rm = TRUE), 
      mn_ht = mean(height, na.rm = TRUE)
  )

Getting Summaries

We can group categorical variables by their levels

transport %>%
  group_by(gender) %>%
  summarise(
      mn_wt = mean(weight, na.rm = TRUE), 
      mn_ht = mean(height, na.rm = TRUE)
  )

Getting Group Summaries

Or combinations of levels

transport %>%
  group_by(gender, method) %>%
  summarise(
      mn_wt = mean(weight, na.rm = TRUE), 
      mn_ht = mean(height, na.rm = TRUE)
  )

We can use any function that spits out a single value

sd(), min(), median()

Getting Group Summaries

We could assign our own names to the columns

transport %>%
  group_by(gender, method) %>%
  summarise(
      mn_wt = mean(weight, na.rm = TRUE), 
      mn_ht = mean(height, na.rm = TRUE)
  )

Getting Group Summaries

And we can count how many entries we have using n()

transport %>%
  group_by(gender, method) %>%
  summarise(
      mn_wt = mean(weight, na.rm = TRUE), 
      mn_ht = mean(height, na.rm = TRUE),
      n = n())

Applying Multiple Functions

Use the function summarise_at()
Can apply starts_with(), ends_with(), contains() syntax to specify variables/columns
- Must be wrapped in vars()
Can apply multiple functions
- Must be given as a list using list(~function)

Applying Multiple Functions

transport %>%
  group_by(gender, method) %>%
  summarise_at(
      vars(ends_with("ght")), 
      list(~mean, ~sd), 
      na.rm = TRUE)

Applying Multiple Functions

We could reorder the results using select()

transport %>%
  group_by(gender, method) %>%
  summarise_at(
      vars(ends_with("ght")), 
      list(~mean, ~sd), 
      na.rm = TRUE) %>%
  select(gender, method, 
         starts_with("weight"), 
         starts_with("height"))

Using `dplyr::case_when()`

Kalen just had her height measured and she is 158.1cm. That missing value can be fixed!!!

transport %>%
  mutate(height = case_when(
    name == "Kalen" ~ 158.1,
    name != "Kalen" ~ height
  ))

case_when() is like if_else() but allows for multiple options

Using `dplyr::case_when()`

Let's overwrite that initial object

transport <- transport %>%
  mutate(height = case_when(
    name == "Kalen" ~ 158.1,
    name != "Kalen" ~ height
  ))

What would happen if we accidentally ran this again?

Combining Data Frames

dplyr has some very useful functions for combining data.frame objects: bind_rows(), left_join(), right_join() and full_join()

Unbelievably, we've just had even more data.

Tony is a 83kg car-driver, who is 177.3cm tall!

We can manually create a tibble for him

Combining Data Frames

tony <- tibble(
    gender = "male",
    name = "Tony",
    weight = 83,
    height = 177.3,
    method = "car"
)

We can add his data using bind_rows()

bind_rows(tony, transport)

Combining Data Frames

Download the object contacts.csv to your data folder
Load the contact numbers in the file contacts.csv

contacts <- read_csv("data/contacts.csv")
dim(contacts)
dim(transport)

Combining Data Frames

Do these objects look compatible?

Some may be missing in one of the objects
The values in the name column look similar
In a different order

How do we combine these?

Combining Data Frames

`left_join()`

We can use the first data_frame as the template

left_join(transport, contacts, by = "name")

Any name entries missing from the second object (contacts) will become a missing value in the column contact
Any name entries in the first which have multiple matches in the second will be expanded

Combining Data Frames

`right_join()`

We can use the second object as the template

right_join(transport, contacts, by = "name")

Any missing from the first object (data) will become a row of missing values where missing

Combining Data Frames

`full_join()`

Or we can combine everything

full_join(transport, contacts, by = "name")

Will be missing values in both directions

Reshaping your data

This dataset is in what we refer to as wide form
We have a row of measurements for each individual
The information is structured around the individual

In long form, the information is structured around the measurement

Reshaping your data

From Wide to Long

wideData <- read_csv("data/wide.csv")
wideData

This is a time course:

Measuring log-change in GFP intensity
Two treatments (Tx) A & B
Also have identifier columns

From Wide to Long

The basic function is gather()

Part of the core tidyverse package tidyr
Key arguments are:
- key: what are the column names describing
- value: what is the value in these columns
Select columns to gather using dplyr helper functions

From Wide to Long

wideData %>%
  gather(key = "Day",
         value = "Change",
         contains("day"))

Many functions require data to be in this format

From Wide to Long

How could we get means for each treatment/day from the original data?

From Wide to Long

How could we get means for each treatment/day from the original data?

wideData %>%
  gather(key = "Day",
         value = "Change",
         contains("day")) %>%
  group_by(Tx, Day) %>%
  summarise(mn_change = mean(Change))

Might be easier if needing more summaries (e.g. sd())

Reshaping your data

From Long To Wide

Let's save that last summary data.frame

wideMeans <- wideData %>%
  gather(key = "Day",
         value = "Change",
         contains("day")) %>%
  group_by(Tx, Day) %>%
  summarise(mn_change = mean(Change))

From Long To Wide

We can change from long to wide using the 'opposite' function spread()

Values in the key = column become the new column names
Values in the value = column remain the values

wideMeans %>%
  spread(key = Day,
         value = mn_change)

From Long To Wide

Would we ever use both gather() and spread()?

pcr <- read_csv("data/PCR.csv")
head(pcr)

Here we have 3 genes being measured in 2 cell types, across 3 time-points

From Long To Wide

The first part is easy:

pcr %>%
  gather(key = CellType,
         value = Ct,
         ends_with("hr"))

If we want to now organise by treatment and timepoint:

Need to separate CellType into two columns

From Long To Wide

We can use the function separate()

pcr %>%
  gather(key = CellType,
         value = Ct,
         ends_with("hr")) %>%
  separate(col = CellType, 
           into = c("CellType", "TimePoint"))

From Long To Wide

Now we can form our final data structure

pcr %>%
  gather(key = CellType,
         value = Ct,
         ends_with("hr")) %>%
  separate(col = CellType, 
           into = c("CellType", "TimePoint")) %>%
  spread(key = TimePoint,
         value = Ct)

Data Manipulation

Data Manipulation

Data Manipulation

Loading The Packages

But First

Logical Tests

Logical Tests

The package dplyr

Starting with dplyr

Starting with dplyr::select()

Starting with dplyr::select()

Starting with dplyr::select()

Starting with dplyr::select()

Starting with dplyr::select()

Using dplyr::filter()

Using dplyr::arrange()

Combining Functions

Combining Functions

Combining Functions

Combining Functions

For the technically minded

Using dplyr::mutate()

Using dplyr::mutate()

Using dplyr::rename()

Getting Summaries

Getting Summaries

Getting Summaries

Getting Group Summaries

Getting Group Summaries

Getting Group Summaries

Applying Multiple Functions

Applying Multiple Functions

Applying Multiple Functions

Applying Multiple Functions

Using dplyr::case_when()

Using dplyr::case_when()

Combining Data Frames

Combining Data Frames

Combining Data Frames

Combining Data Frames

Combining Data Frames

Combining Data Frames

left_join()

Combining Data Frames

right_join()

Combining Data Frames

full_join()

Reshaping your data

Reshaping your data

Reshaping your data

From Wide to Long

From Wide to Long

From Wide to Long

From Wide to Long

From Wide to Long

Reshaping your data

From Long To Wide

From Long To Wide

From Long To Wide

From Long To Wide

From Long To Wide

From Long To Wide

The package `dplyr`

Starting with `dplyr`

Starting with `dplyr::select()`

Starting with `dplyr::select()`

Starting with `dplyr::select()`

Starting with `dplyr::select()`

Starting with `dplyr::select()`

Using `dplyr::filter()`

Using `dplyr::arrange()`

Using `dplyr::mutate()`

Using `dplyr::mutate()`

Using `dplyr::rename()`

Using `dplyr::case_when()`

Using `dplyr::case_when()`

`left_join()`

`right_join()`

`full_join()`