15 April 2019
Also known as data munging
We'll cover:
SQL-
and Excel-
like functions in dplyr
tidyr
stringr
library(tidyverse)
dplyr
works specifically with data.frame
objectstibble
objectslibrary(tidyverse)
loads dplyr
, stringr
and tidyr
(+ others)==
!=
|
<
<=
x <- c(1:5, NA) x == 5 x != 5 x > 3 x > 3 | x == 2 is.na(x)
R
also recognises the symbol &
for AND
dplyr
dplyr
dplyr
Data for this session:
transport <- read_csv("data/transport_clean.csv") transport
So we have a 32 x 5 data frame
dim(transport) nrow(transport) ncol(transport)
dplyr::select()
dplyr
is part of the core tidyverse
select()
allows you to select columns by namedata_frame
must be given as the first argumentselect(transport, gender, weight, height, method)
Or by position
select(transport, 1:3)
dplyr::select()
-
) signselect(transport, -2) select(transport, -name)
Discuss: Would removing by name or position by the best?
dplyr::select()
The select()
function has a few helper functions
starts_with()
, ends_with()
, contains()
, one_of()
and everything()
select(transport, ends_with("ght")) select(transport, contains("t"))
dplyr::select()
So far, we haven't changed the original object
We can overwrite this anytime (sometimes we will by accident)
transport <- select(transport, -name) transport
Now we have removed the name
column in our oject transport
dplyr::select()
.csv
filetransport <- read_csv("data/transport_clean.csv")
(Ah, the joys of using code…)
dplyr::filter()
data_frame
as the first argumentfilter(transport, method == "car") filter(transport, method == "car", gender == "female")
dplyr::arrange()
We can sort on one or more columns
arrange(transport, weight) arrange(transport, method, height)
We can change to descending order using desc()
arrange(transport, desc(weight))
dplyr
steps up a gear%>%
|
in the bash shell (Covered tomorrow)magrittr
dplyr
Places the output of the first function as the first argument of the next function!
filter(transport, method == "bike")
Is identical to
transport %>% filter(method == "bike")
But now we can chain together multiple operations!
transport %>% filter(method == "bike") %>% arrange(weight)
There is no limit to the number of functions you can connect together
dplyr
takes a data.frame
as the first argumentdata.frame
transport %>% filter(method == "bike") %>% arrange(weight)
magrittr
placed transport
into the first argument of filter()
filter()
was a data.frame
\(\implies\) this was placed as the first argument to arrange()
dplyr::mutate()
We can add extra columns using mutate()
transport %>% mutate(height_m = height/100)
Once we've added a column, we can refer to it by name
transport %>% mutate(height_m = height/100, BMI = weight / height_m^2)
dplyr::mutate()
We can also overwrite existing columns
transport %>% mutate(height = height/100)
Have we changed the original data_frame
yet?
dplyr::rename()
Can use the function rename()
transport %>% rename(height_cm = height)
Now we can get crazy
transport %>% rename(height_cm = height) %>% mutate(height_m = height_cm/100, BMI = weight / height_m^2) %>% filter(BMI > 25)
We can get summaries for entire columns
transport %>% summarise(mean(weight), mean(height))
Let's deal with that missing value:
transport %>% summarise( mn_wt = mean(weight, na.rm = TRUE), mn_ht = mean(height, na.rm = TRUE) )
We can group categorical variables by their levels
transport %>% group_by(gender) %>% summarise( mn_wt = mean(weight, na.rm = TRUE), mn_ht = mean(height, na.rm = TRUE) )
Or combinations of levels
transport %>% group_by(gender, method) %>% summarise( mn_wt = mean(weight, na.rm = TRUE), mn_ht = mean(height, na.rm = TRUE) )
We can use any function that spits out a single value
sd()
, min()
, median()
We could assign our own names to the columns
transport %>% group_by(gender, method) %>% summarise( mn_wt = mean(weight, na.rm = TRUE), mn_ht = mean(height, na.rm = TRUE) )
And we can count how many entries we have using n()
transport %>% group_by(gender, method) %>% summarise( mn_wt = mean(weight, na.rm = TRUE), mn_ht = mean(height, na.rm = TRUE), n = n())
summarise_at()
starts_with()
, ends_with()
, contains()
syntax to specify variables/columns
vars()
list
using list(~function)
transport %>% group_by(gender, method) %>% summarise_at( vars(ends_with("ght")), list(~mean, ~sd), na.rm = TRUE)
We could reorder the results using select()
transport %>% group_by(gender, method) %>% summarise_at( vars(ends_with("ght")), list(~mean, ~sd), na.rm = TRUE) %>% select(gender, method, starts_with("weight"), starts_with("height"))
dplyr::case_when()
Kalen just had her height measured and she is 158.1cm. That missing value can be fixed!!!
transport %>% mutate(height = case_when( name == "Kalen" ~ 158.1, name != "Kalen" ~ height ))
case_when()
is like if_else()
but allows for multiple options
dplyr::case_when()
Let's overwrite that initial object
transport <- transport %>% mutate(height = case_when( name == "Kalen" ~ 158.1, name != "Kalen" ~ height ))
What would happen if we accidentally ran this again?
dplyr
has some very useful functions for combining data.frame
objects: bind_rows()
, left_join()
, right_join()
and full_join()
Unbelievably, we've just had even more data.
Tony
is a 83kg car-driver, who is 177.3cm tall!
We can manually create a tibble
for him
tony <- tibble( gender = "male", name = "Tony", weight = 83, height = 177.3, method = "car" )
We can add his data using bind_rows()
bind_rows(tony, transport)
contacts.csv
to your data foldercontacts.csv
contacts <- read_csv("data/contacts.csv") dim(contacts) dim(transport)
Do these objects look compatible?
name
column look similarHow do we combine these?
left_join()
We can use the first data_frame
as the template
left_join(transport, contacts, by = "name")
name
entries missing from the second object (contacts
) will become a missing value in the column contact
name
entries in the first which have multiple matches in the second will be expandedright_join()
We can use the second object as the template
right_join(transport, contacts, by = "name")
Any missing from the first object (data
) will become a row of missing values where missing
full_join()
Or we can combine everything
full_join(transport, contacts, by = "name")
Will be missing values in both directions
wide
formIn long
form, the information is structured around the measurement
wideData <- read_csv("data/wide.csv") wideData
This is a time course:
Tx
) A & BThe basic function is gather()
tidyr
key
: what are the column names describingvalue
: what is the value in these columnsgather
using dplyr
helper functionswideData %>% gather(key = "Day", value = "Change", contains("day"))
Many functions require data to be in this format
How could we get means for each treatment/day from the original data?
How could we get means for each treatment/day from the original data?
wideData %>% gather(key = "Day", value = "Change", contains("day")) %>% group_by(Tx, Day) %>% summarise(mn_change = mean(Change))
Might be easier if needing more summaries (e.g. sd()
)
Let's save that last summary data.frame
wideMeans <- wideData %>% gather(key = "Day", value = "Change", contains("day")) %>% group_by(Tx, Day) %>% summarise(mn_change = mean(Change))
We can change from long to wide using the 'opposite' function spread()
key =
column become the new column namesvalue =
column remain the valueswideMeans %>% spread(key = Day, value = mn_change)
Would we ever use both gather()
and spread()
?
pcr <- read_csv("data/PCR.csv") head(pcr)
Here we have 3 genes being measured in 2 cell types, across 3 time-points
The first part is easy:
pcr %>% gather(key = CellType, value = Ct, ends_with("hr"))
If we want to now organise by treatment and timepoint:
CellType
into two columnsWe can use the function separate()
pcr %>% gather(key = CellType, value = Ct, ends_with("hr")) %>% separate(col = CellType, into = c("CellType", "TimePoint"))
Now we can form our final data structure
pcr %>% gather(key = CellType, value = Ct, ends_with("hr")) %>% separate(col = CellType, into = c("CellType", "TimePoint")) %>% spread(key = TimePoint, value = Ct)