20 July 2016
Also known as data munging
We'll cover:
SQL- and Excel-like functions in dplyrreshape2stringrlibrary(readr) library(dplyr) library(tibble)
dplyr works specifically with data.frame objectslocal data frame aka tbl_df objectstibble objects==!=|<<=x <- 1:10 x == 5 x !=5 x > 5 x > 5 | x == 2
R also recognises the symbol & for ANDdplyrdplyrData for this session:
data <- read_csv("data/comments.csv", comment = "#")
data
So we have a 32 x 6 data frame
dim(data) nrow(data) ncol(data)
dplyr select() functionThe first row is the row names from the original file.
How can we remove this column?
The function select() allows you to select columns by name
select(data, gender, name, weight, height, transport)
dplyr select() functionThe first row is the row names from the original file.
How can we remove this column?
The function select() allows you to select columns by name
select(data, gender, name, weight, height, transport)
Or by position
select(data, 2:6)
dplyr select() function-) signselect(data, -1)
select(data, -transport)
Discuss: Would removing by name or position by the best?
select() functionThe select() function has a few bonus functions:
starts_with(), ends_with(), contains(), one_of() and everything()select(data, ends_with("t"))
select(data, contains("eig"))
select() functionSo far, we haven't changed the original object
We can overwrite this anytime (sometimes by accident)
data <- select(data, -1) data
Now we have removed the meaningless columns
.csv filefilter() and arrange()We can use our logical tests to filter the data
filter(data, transport == "car") filter(data, transport == "car", gender == "female")
We can sort on one or more columns
arrange(data, weight) arrange(data, desc(weight)) arrange(data, transport, height)
dplyr steps up a gear%>%| in the bash shellmagrittrdata %>% filter(transport == "bike")
data %>% filter(transport == "bike") %>% arrange(weight)
There is no limit to the number of functions you can chain together
Each function in dplyr takes a data.frame as the first argument
The magrittr pipes our (modified) data.frame into the next function as the first argument
We can add extra columns using mutate()
data %>% mutate(height_m = height/100)
Once we've added a column, we can refer to it by name
data %>% mutate(height_m = height/100, BMI = weight / height_m^2)
We can also overwrite existing columns
data %>% mutate(height = height/100)
Have we changed the original data.frame?
Can use the function rename()
data %>% rename(height_cm = height)
Now we can get crazy
data %>%
rename(height_cm = height) %>%
mutate(height_m = height_cm/100,
BMI = weight / height_m^2) %>%
filter(BMI > 25)
Again, this is where dplyr really makes it easy.
data %>% summarise(mean(weight), mean(height))
data %>% summarise_each(funs(mean, sd), ends_with("ght"))
We can group categorical variables by their levels
data %>%
group_by(gender) %>%
summarise_each(funs(mean), ends_with("ght"))
Or combinations of levels
data %>%
group_by(gender, transport) %>%
summarise_each(funs(mean), ends_with("ght"))
We can use any function that spits out a single value
sd(), min(), median()n()data %>%
group_by(gender, transport) %>%
summarise(mn_weight = mean(weight),
mn_height = mean(height),
n = n())
This dataset is in what we refer to as wide form
long form, the information is structured around the measurementlibrary(reshape2) library(readr)
wideData <- read_csv("data/wide.csv")
This is a time course for two treatments
melt(wideData, id.vars = c("Name", "Tx"))
Many functions require data to be in this format
We don't need to leave those names as variable and value
wideData %>%
melt(id.vars = c("Name", "Tx"),
variable.name = "Day",
value.name = "Change")
1 How could we get means for each treatment/day from the original data?
wideData %>%
group_by(Tx) %>%
summarise_each(funs(mean), starts_with("day"))
2 How can we get the same from the data after "melting"?
wideData %>%
melt(id.vars = c("Name", "Tx"),
variable.name = "Day",
value.name = "Change") %>%
group_by(Tx, Day) %>%
summarise(mn_change = mean(Change))
Let's save that last summary data.frame
wideMeans <- wideData %>%
melt(id.vars = c("Name", "Tx"),
variable.name = "Day",
value.name = "Change") %>%
group_by(Tx, Day) %>%
summarise(mn_change = mean(Change))
We can change from long to wide using the formula syntax
~" stands for is a function of, or depends ondcast uses it to define rows on the LHS and columns on the RHSdcast(wideMeans, Tx~Day) dcast(wideMeans, Day~Tx)
Would we ever use both?
pcr <- read_csv("data/PCR.csv")
Here we have 3 genes being measured in 2 cell types, across 3 time-points
The first part is easy:
melt(pcr, id.vars = "Gene", variable.name = "CellType", value.name = "Ct")
We might like to split those names into the two cell types and 3 time-points.
str_extract() and str_replace()library(stringr)
pcr %>%
melt(id.vars = "Gene",
variable.name = "CellType", value.name = "Ct") %>%
mutate(TimePoint = str_extract(CellType, "(0hr|12hr|24hr)"),
CellType = str_replace(CellType, "_(0hr|12hr|24hr)", ""))
Save this as the object pcr_long
Now we could make it wide (but not as wide)
pcr %>%
melt(id.vars = "Gene",
variable.name = "CellType", value.name = "Ct") %>%
mutate(TimePoint = str_extract(CellType, "(0hr|12hr|24hr)"),
CellType = str_replace(CellType, "_(0hr|12hr|24hr)", "")) %>%
dcast(Gene + CellType ~ TimePoint, value.var = "Ct")
To investigate the effect of fitness, gender, country of origin on red blood cell count (RBC), you have been given challenge.csv with observations made on 500 subjects. The variables are:
Clean the data entirely in R