This is also known as data munging or data wrangling and is one of the most common processes in any bioinformatics or statistical analysis.
We’ll cover:
SQL-
and Excel-
like functions in dplyr
select()
: selecting and rearranging columnsrename()
: renaming columnsmutate()
: creating and overwriting columnsarrange()
: reordering columnssummarise()
: obtaining summary values for a columngroup_by()
: defining grouping variables for summariestidyr
gather()
: changing from wide to long formspread()
: changing from long to wide formstringr
str_extract_all()
: using regular expressions to extract patterns within stringsThe datasets required are contained in the files 1) transport.csv
and 2) pcr.csv
Create a new R Script called DataWrangling.R
dplyr
dplyr
This package is also loaded by default when we enter library(tidyverse)
. For this session we’ll use the same data as for the previous session. Note that now we’ll define it as an R object and then we’ll pass it to read_csv()
. To check that we have this correct, we’ll first check that the file exists, using the appropriately named function file.exists()
.
library(tidyverse)
csvFile <- file.path("~", "data", "intro_r", "transport.csv")
file.exists(csvFile)
data <- read_csv(csvFile)
data
So we have a x data frame
, which has the familiar structure of rows and columns, and may look a little like a spreadsheet in Excel. We can check the dimensions of this object.
dim(data)
nrow(data)
ncol(data)
Our tasks will be to:
data.frame
The conventional R
method of subsetting a data.frame
is to use the square brackets ([]
) and specify rows/columns by position or by name.
For example, to get back the first value of the second column we could type:
data[1, 2]
Or we could get back the first 5 names
data[1:5, "name"]
The package dplyr
has an additional method of selecting columns, by using the function select()
.
select(data, name, transport)
Note that in the above, we have only printed the results to the screen and haven’t created any new R
objects.
An advantage of this method, as opposed to the square brackets is the use of some helper functions starts_with()
, ends_with()
, contains()
and everything()
. We can use these to reorder the columns easily. Try the following, and you’ll see how useful this can be.
select(data, name, everything())
select(data, ends_with("ght"))
select(data, contains("a"))
We can also use the -
sign before a name to remove columns.
select(data, -name)
It looks like the frst column (X1
) is actually just the rownames written to the file when saved by our collaborators. Let’s remove that column, and this time we’ll over-write the object. If you haven’t been keeping track of the above in your script, this would be an important line to put there as we are overwriting the original object. If we delete a column and then realise we need it back , we can just rerun our code to reload the object.
data <- select(data, -X1)
This is an alternate method to skipping the column when importing using read_csv()
. In the function select()
, we can also use column numbers but in general it is best practice to use column names.
We can add extra columns using mutate()
. It’s clear that the height here is provided in cm, but for BMI calculations we would need height in m. We can add a column and perform calculations add the same time.
mutate(data, height_m = height/100)
We can also perform multiple column additions in the same step.
mutate(data, height_m = height/100, BMI = weight / height_m^2)
Note that we haven’t overwritten our original object, so let’s add those columns permanently to our data.frame
. Don’t forget to add this line to your script! Add a comment explaining what you’ve done if you like. It’s probably a good idea for reading your code back.
data <- mutate(data, height_m = height/100, BMI = weight / height_m^2)
Now we have a column called height
and another called height_m
so it might be sensible to rename our original column as height_cm
to make this clear. The function rename()
is all we need to perform this action. This time we’ll just overwrite the new version of data
straight away.
data <- rename(data, height_cm = height)
data
Many of us are familiar with the Auto-filter in Excel, and dplyr
has equivalent function, called filter()
. We can use logical tests to build up complex filtering criteria. Again, note that in the next few lines, we are not overwriting our object but are just exploring our data.
filter(data, gender == "male")
Notice that in the above line we used the double equals (==
) sign. This is common syntax in most programming languages, whilst a single equals sign (=
) usually means we are assigning a value to an object or variable. To perform the test not equal to, we replace the first equals sign with an exclamation mark (!
).
filter(data, gender != "male")
We can build up complex filters by adding them inside the filter function with a comma between them.
filter(data, gender == "male", height_cm > 175)
filter(data, transport == "car", gender == "female")
We can also use the function arrange()
to sort our data, just like the sort function in Excel. By default, values are sorted in ascending order, so to reverse this we just wrap the column name in desc()
.
arrange(data, weight)
arrange(data, desc(weight))
We can also sort on multiple columns in the same line.
arrange(data, transport, height_cm)
%>%
This is where dplyr
steps up a gear. We can chain functions together using the symbol %>%
, which behaves like a pipe symbol (|
) in the bash shell. This function/symbol is called The Magrittr or The Pipe after a famous painting by Rene Magritte It is contained in the package magrittr
, and is loaded by dplyr
every time we load dplyr
.
The %>%
symbol places whatever precedes it as the first argument of the next function. An alternative way to write our previous line of code would be:
data %>% arrange(transport, height_cm)
We can now use this to combine functions into a longer chain of commands, and there is no limit to the number of functions you can chain together.
data %>% filter(transport == "bike") %>% arrange(weight)
This specifically works here because every function takes a data.frame
as input, and provides a data.frame
as output. From this point forward, we’ll use this as our standard syntax when using dplyr
functions.
As we’re now able to build up long chains of commands, we can also spread our code over multiple lines. Ending a line with %>%
symbol let’s R
know that another line is coming. This has the dual advantage of being easier to read later, and enabling us to write comments at the end of each line.
data %>% # Take our original dataset
filter(transport == "bike") %>% # Find the cyclists
arrange(weight) # Arrange by weight
We can get summaries for entire columns:
data %>% summarise(mean(weight), mean(height_cm))
Or we could chain together a few commands, and provide new names for our sumary columns.
data %>%
filter(gender == "female",
transport == "bike") %>%
summarise(max_BMI = max(BMI),
mn_height = mean(height_cm))
We can use any function here that returns a single value, such as min()
, max()
, mean()
, sd()
, median()
etc.
As an additionally useful feature, we can group categorical variables by their levels, and even count how many measurements we have.
data %>%
group_by(gender, transport) %>%
summarise(mn_weight = mean(weight),
mn_height =mean(height_cm),
mn_BMI = mean(BMI),
n = n())
As you can see, this is very similar to Excel, except we don’t have to repeat anything if new data arrives. We just load the now file, and run our code. And we can remember exactly what we’ve done, withouth accidentally over-writing our original data. It’s all happened in R
while our original file is still unchanged on our hard-drive. This can aid with reproducible research, and also help prevent catastrophes.
In this section we will need the pcr.csv
dataset in the data folder. Let’s load the data first by placing this at the beginning of your new script file.
library(tidyverse)
pcrFile <- file.path("~", "data", "intro_r", "pcr.csv")
file.exists(pcrFile)
pcrData <- read_csv(pcrFile)
Here we have RT-PCR data from T cells as a time course for two treatments (Resting and Stim). The values are Ct
values and we have three genes under investigation.
This dataset is in what we refer to as wide
form where we have a row of measurements for each individual gene. The information is structured around the gene. In long
form, the information would be structured around the measurement.
To perform this inR
we can simply use the package tidyr
which is loaded with the call to library(tidyverse)
. The function we’ll use is gather()
, which gathers multiple columns into a single column.
pcrData %>% gather(key = "variable", value = "value", -Gene)
Notice that this has held the Gene
column fixed but the remainder of the columns have been ‘gathered’ down the screen. The column names are in a new column called variable
, and the values are in a column called value
. Now you can see that we have a unique row for each measurement, containing all the information about that measurement.
By adding the final -Gene
we are telling gather to ignore the Gene
column. An alternative method would be to use ends_with("hr")
. Try it if you are interested.
Let’s save that as a new object, and give those columns better names while we’re at it.
pcrLong <- pcrData %>% gather(key = "Treatment", value = "Ct", -Gene)
head(pcrLong)
The important column here is called Treatment
and contains data that should be placed in two columns. We can separate this column at the _
using the function separate()
.
pcrLong %>%
separate(Treatment, into = c("Treatment", "Timepoint"), sep = "_")
If we wanted to return this to wide form we can use the function spread()
.