15 April 2019
Also known as data munging
We'll cover:
SQL- and Excel-like functions in dplyrtidyrstringrlibrary(tidyverse)
dplyr works specifically with data.frame objectstibble objectslibrary(tidyverse) loads dplyr, stringr and tidyr (+ others)==!=|<<=x <- c(1:5, NA) x == 5 x != 5 x > 3 x > 3 | x == 2 is.na(x)
R also recognises the symbol & for ANDdplyrdplyrdplyrData for this session:
transport <- read_csv("data/transport_clean.csv")
transport
So we have a 32 x 5 data frame
dim(transport) nrow(transport) ncol(transport)
dplyr::select()dplyr is part of the core tidyverseselect() allows you to select columns by namedata_frame must be given as the first argumentselect(transport, gender, weight, height, method)
Or by position
select(transport, 1:3)
dplyr::select()-) signselect(transport, -2) select(transport, -name)
Discuss: Would removing by name or position by the best?
dplyr::select()The select() function has a few helper functions
starts_with(), ends_with(), contains(), one_of() and everything()select(transport, ends_with("ght"))
select(transport, contains("t"))
dplyr::select()So far, we haven't changed the original object
We can overwrite this anytime (sometimes we will by accident)
transport <- select(transport, -name) transport
Now we have removed the name column in our oject transport
dplyr::select().csv filetransport <- read_csv("data/transport_clean.csv")
(Ah, the joys of using code…)
dplyr::filter()data_frame as the first argumentfilter(transport, method == "car") filter(transport, method == "car", gender == "female")
dplyr::arrange()We can sort on one or more columns
arrange(transport, weight) arrange(transport, method, height)
We can change to descending order using desc()
arrange(transport, desc(weight))
dplyr steps up a gear%>%| in the bash shell (Covered tomorrow)magrittrdplyrPlaces the output of the first function as the first argument of the next function!
filter(transport, method == "bike")
Is identical to
transport %>% filter(method == "bike")
But now we can chain together multiple operations!
transport %>% filter(method == "bike") %>% arrange(weight)
There is no limit to the number of functions you can connect together
dplyr takes a data.frame as the first argumentdata.frametransport %>% filter(method == "bike") %>% arrange(weight)
magrittr placed transport into the first argument of filter()filter() was a data.frame \(\implies\) this was placed as the first argument to arrange()dplyr::mutate()We can add extra columns using mutate()
transport %>% mutate(height_m = height/100)
Once we've added a column, we can refer to it by name
transport %>%
mutate(height_m = height/100,
BMI = weight / height_m^2)
dplyr::mutate()We can also overwrite existing columns
transport %>% mutate(height = height/100)
Have we changed the original data_frame yet?
dplyr::rename()Can use the function rename()
transport %>% rename(height_cm = height)
Now we can get crazy
transport %>%
rename(height_cm = height) %>%
mutate(height_m = height_cm/100,
BMI = weight / height_m^2) %>%
filter(BMI > 25)
We can get summaries for entire columns
transport %>% summarise(mean(weight), mean(height))
Let's deal with that missing value:
transport %>%
summarise(
mn_wt = mean(weight, na.rm = TRUE),
mn_ht = mean(height, na.rm = TRUE)
)
We can group categorical variables by their levels
transport %>%
group_by(gender) %>%
summarise(
mn_wt = mean(weight, na.rm = TRUE),
mn_ht = mean(height, na.rm = TRUE)
)
Or combinations of levels
transport %>%
group_by(gender, method) %>%
summarise(
mn_wt = mean(weight, na.rm = TRUE),
mn_ht = mean(height, na.rm = TRUE)
)
We can use any function that spits out a single value
sd(), min(), median()We could assign our own names to the columns
transport %>%
group_by(gender, method) %>%
summarise(
mn_wt = mean(weight, na.rm = TRUE),
mn_ht = mean(height, na.rm = TRUE)
)
And we can count how many entries we have using n()
transport %>%
group_by(gender, method) %>%
summarise(
mn_wt = mean(weight, na.rm = TRUE),
mn_ht = mean(height, na.rm = TRUE),
n = n())
summarise_at()starts_with(), ends_with(), contains() syntax to specify variables/columns
vars()list using list(~function)transport %>%
group_by(gender, method) %>%
summarise_at(
vars(ends_with("ght")),
list(~mean, ~sd),
na.rm = TRUE)
We could reorder the results using select()
transport %>%
group_by(gender, method) %>%
summarise_at(
vars(ends_with("ght")),
list(~mean, ~sd),
na.rm = TRUE) %>%
select(gender, method,
starts_with("weight"),
starts_with("height"))
dplyr::case_when()Kalen just had her height measured and she is 158.1cm. That missing value can be fixed!!!
transport %>%
mutate(height = case_when(
name == "Kalen" ~ 158.1,
name != "Kalen" ~ height
))
case_when() is like if_else() but allows for multiple options
dplyr::case_when()Let's overwrite that initial object
transport <- transport %>%
mutate(height = case_when(
name == "Kalen" ~ 158.1,
name != "Kalen" ~ height
))
What would happen if we accidentally ran this again?
dplyr has some very useful functions for combining data.frame objects: bind_rows(), left_join(), right_join() and full_join()
Unbelievably, we've just had even more data.
Tony is a 83kg car-driver, who is 177.3cm tall!
We can manually create a tibble for him
tony <- tibble(
gender = "male",
name = "Tony",
weight = 83,
height = 177.3,
method = "car"
)
We can add his data using bind_rows()
bind_rows(tony, transport)
contacts.csv to your data foldercontacts.csvcontacts <- read_csv("data/contacts.csv")
dim(contacts)
dim(transport)
Do these objects look compatible?
name column look similarHow do we combine these?
left_join()We can use the first data_frame as the template
left_join(transport, contacts, by = "name")
name entries missing from the second object (contacts) will become a missing value in the column contactname entries in the first which have multiple matches in the second will be expandedright_join()We can use the second object as the template
right_join(transport, contacts, by = "name")
Any missing from the first object (data) will become a row of missing values where missing
full_join()Or we can combine everything
full_join(transport, contacts, by = "name")
Will be missing values in both directions
wide formIn long form, the information is structured around the measurement
wideData <- read_csv("data/wide.csv")
wideData
This is a time course:
Tx) A & BThe basic function is gather()
tidyrkey: what are the column names describingvalue: what is the value in these columnsgather using dplyr helper functionswideData %>%
gather(key = "Day",
value = "Change",
contains("day"))
Many functions require data to be in this format
How could we get means for each treatment/day from the original data?
How could we get means for each treatment/day from the original data?
wideData %>%
gather(key = "Day",
value = "Change",
contains("day")) %>%
group_by(Tx, Day) %>%
summarise(mn_change = mean(Change))
Might be easier if needing more summaries (e.g. sd())
Let's save that last summary data.frame
wideMeans <- wideData %>%
gather(key = "Day",
value = "Change",
contains("day")) %>%
group_by(Tx, Day) %>%
summarise(mn_change = mean(Change))
We can change from long to wide using the 'opposite' function spread()
key = column become the new column namesvalue = column remain the valueswideMeans %>%
spread(key = Day,
value = mn_change)
Would we ever use both gather() and spread()?
pcr <- read_csv("data/PCR.csv")
head(pcr)
Here we have 3 genes being measured in 2 cell types, across 3 time-points
The first part is easy:
pcr %>%
gather(key = CellType,
value = Ct,
ends_with("hr"))
If we want to now organise by treatment and timepoint:
CellType into two columnsWe can use the function separate()
pcr %>%
gather(key = CellType,
value = Ct,
ends_with("hr")) %>%
separate(col = CellType,
into = c("CellType", "TimePoint"))
Now we can form our final data structure
pcr %>%
gather(key = CellType,
value = Ct,
ends_with("hr")) %>%
separate(col = CellType,
into = c("CellType", "TimePoint")) %>%
spread(key = TimePoint,
value = Ct)