27 November 2016
Also known as data munging
We'll cover:
data.frame objects (or tibbles)SQL- and Excel-like functions in dplyrreshape2fastq librariesUsing_dplyr.Rlibrary(readr) library(dplyr) library(tibble)
dplyr has functions to work specifically with data.frame objectslocal data frame aka tbl_df objectstibble objects==!=|<<=x <- 1:10 x == 5 x != 5 x > 5 x > 5 | x == 2
R also recognises the symbol & for ANDdplyrTRUE in the previous testswhich(x >5)
dplyrData for this session:
data <- read_csv("data/comments.csv", comment = "#")
data
So we have a 32 x 6 data frame
dim(data) nrow(data) ncol(data)
dplyr select() functionThe first row is the row names from the original file.
How can we remove this column?
(We could use data[,2:5])
dplyr select() functionThe first row is the row names from the original file.
How can we remove this column?
The function select() allows you to select columns by name
select(data, gender, name, weight, height, transport)
dplyr select() functionThe first row is the row names from the original file.
How can we remove this column?
The function select() allows you to select columns by name (good)
select(data, gender, name, weight, height, transport)
dplyr select() functionOr by position (not so good)
select(data, 2:6)
Discuss: What are the merits of calling by name or position?
dplyr select() function-) signselect(data, -X1)
select(data, -transport)
select() functionThe select() function has a few bonus functions:
starts_with(), ends_with(), contains(), one_of() and everything()select(data, ends_with("t"))
select(data, contains("eig"))
select() functionSo far, we haven't changed the original object
We can overwrite this anytime (sometimes by accident)
data <- select(data, -X1) data
Now we have removed the meaningless columns
To get the column back, we just need to reload the .csv file using our command we've saved in the script
filter() and arrange()We can use our logical tests to filter the data
filter(data, transport == "car") filter(data, transport == "car", gender == "female")
We can sort on one or more columns
arrange(data, weight) arrange(data, desc(weight)) arrange(data, transport, height)
dplyr steps up a gear%>%| in the bash shellmagrittr
data %>% filter(transport == "bike")
And then sort based on weight
data %>% filter(transport == "bike") %>% arrange(weight)
There is no limit to the number of functions you can chain together
Each function in dplyr takes a data.frame as the first argument
The magrittr pipes our (modified) data.frame into the next function as the first argument
?select
We can add extra columns using mutate()
data %>% mutate(height_m = height/100)
Once we've added a column, we can refer to it by name
data %>% mutate(height_m = height/100, BMI = weight / height_m^2)
We can also overwrite existing columns
data %>% mutate(height = height/100)
Have we changed the original data.frame?
Can use the function rename()
data %>% rename(height_cm = height)
Now we can get crazy
data %>%
rename(height_cm = height) %>%
mutate(height_m = height_cm/100,
BMI = weight / height_m^2) %>%
filter(BMI > 25)
Again, this is where dplyr really makes it easy.
data %>% summarise(mean(weight), mean(height))
data %>% summarise_each(funs(mean, sd), ends_with("ght"))
We can group categorical variables by their levels
data %>%
group_by(gender) %>%
summarise_each(funs(mean), ends_with("ght"))
Or combinations of levels
data %>%
group_by(gender, transport) %>%
summarise_each(funs(mean), ends_with("ght"))
We can use any function that spits out a single value
sd(), min(), median()n()data %>%
group_by(gender, transport) %>%
summarise(mn_weight = mean(weight),
mn_height = mean(height),
n = n())
This dataset is in what we refer to as wide form
long form, the information is structured around the measurementlibrary(reshape2)
wideData <- read_csv("data/wide.csv")
This is a time course for two treatments
melt(wideData, id.vars = c("Name", "Tx"))
Many functions require data to be in this format
We don't need to leave those names as variable and value
wideData %>%
melt(id.vars = c("Name", "Tx"),
variable.name = "Day",
value.name = "Change")
1 How could we get means for each treatment/day from the original data?
wideData %>%
group_by(Tx) %>%
summarise_each(funs(mean), starts_with("day"))
2 How can we get the same from the data after "melting"?
wideData %>%
melt(id.vars = c("Name", "Tx"),
variable.name = "Day",
value.name = "Change") %>%
group_by(Tx, Day) %>%
summarise(mn_change = mean(Change))
Let's save that last summary data.frame
wideMeans <- wideData %>%
melt(id.vars = c("Name", "Tx"),
variable.name = "Day",
value.name = "Change") %>%
group_by(Tx, Day) %>%
summarise(mn_change = mean(Change))
We can change from long to wide using the formula syntax
~" stands for is a function of, or depends ondcast uses it to define rows on the LHS and columns on the RHSdcast(wideMeans, Tx~Day) dcast(wideMeans, Day~Tx)
Would we ever use both?
pcr <- read_csv("data/PCR.csv")
Here we have 3 genes being measured in 2 cell types, across 3 time-points
The first part is easy:
melt(pcr, id.vars = "Gene", variable.name = "CellType", value.name = "Ct")