Please start a new R Project for today called Practical_2
.
File
> New Project
Save Current Workspace
, choose Don't Save
.New Directory
> New Project
Practical_2
as the Directory name:Create Project
Although most of what we’re doing can be done in the Console, please start a new R Script for today and save this as TextManipulation.R
. A good habit to get into is to write a line (or two) explaining what is happening on the following line, then have your code. Try and get into this habit of writing messages for the future version of yourself, or for other collaborators. Remember, each comment starts with a #
and this is what we need to write ourselves messages. Otherwise R
will try and execute what we are writing and you’ll see horrible error messages.
stringr
Last week, we learned how to:
readr
)tidyr
)dplyr
)%>%
)As we continue our exploration of Data Wrangling, we’ll introduce a very useful package for working with any text strings (or character vectors), known as stringr
. This package is also part of the core tidyverse
. In the Wrangle
section of the following workflow you’ll see it listed alongside dplyr
and magrittr
(%>%
). The package lubridate
is excellent for managing time & date data, whilst forcats
is for categorical data, however we won’t really explore these in this course.
Some immediately useful functions in the stringr
package are str_to_lower()
, str_to_upper()
and str_to_title()
. Let’s load last week’s pcr.csv file to have a look.
library(tidyverse)
pcrFile <- file.path("~", "data", "intro_r", "pcr.csv")
file.exists(pcrFile)
pcrData <- read_csv(pcrFile)
We can actually just grab the Gene
column out from this tibble
using the $
symbol followed by the column name. Note that auto-complete will be your friend here.
pcrData$Gene
Here all the genes are given as upper case, but what our collaborators failed to realise is that these are mouse genes. Only human genes follow the convention of every letter being upper-case, whilst mouse genes have only the first letter as upper-case. Here stringr
is our friend and we can use the function str_to_title()
. (There are no bold or italic fonts in the R Console so we can’t really distinguish genes and proteins.)
str_to_title(pcrData$Gene)
Notice how we’re able to modify a whole set of values (i.e. a column or character vector) with only one command. If we wanted to do this in Excel, we’d have to go to a blank column, call the function PROPER
and point it to the first value in the column to be converted, then fill down until we have all the values in a new column.
R gives us at least two ways to replace these values. The first would be using mutate()
pcrData <- mutate(pcrData, Gene = str_to_title(Gene))
The second would be to use the $
operator to extract just the single column.
pcrData$Gene <- str_to_title(pcrData$Gene)
Explore what you can do with str_to_lower()
and str_to_upper()
just to get a feel for them.
Regular Expressions are an important concept in R
, bash
and most common languages used for bioinformatics. Matching obvious patterns uses a simple syntax. The first argument is the original string
, which is followed by the search pattern
and the replacement
. Here we are searching the string
“Hi Mum” for the pattern
“Mum”, and replacing with the string “Dad”.
str_replace(string = "Hi Mum", pattern = "Mum", replacement = "Dad")
Note that we didn’t technically need specify the argument name as we called them in order. A more succinct version of the above code would be.
str_replace("Hi Mum", "Mum", "Dad")
In regular expression syntax, we specify wild-cards as .
which means “match anything”. This is quite different to many other contexts where we would use an asterisk (*
) as a wild-card. Asterisks have a different meaning entirely when using regular expressions.
str_replace(string = "Hi Mum", pattern = "M..", replacement = "Dad")
We can also match one or more wild-cards by using +
, so in the following we are searching for M
followed by anything, one or more times.
str_replace(string = "Hi Mum", pattern = "M.+", replacement = "Dad")
str_replace(string = "Hi Mother", pattern = "M.+", replacement = "Dad")
We can also capture words/phrases/patterns using the round brackets containing or target (pattern)
. We can then return these in the order we capture them by using the double backslash symbol followed by their capture number. The double backslash is R
-specific syntax and won’t apply when we move to bash in a couple of weeks.
str_replace(string = "Hi Mother", pattern = "(H.+) (M.+)", replacement = "\\2! \\1!")
Note the strategic use of spaces in the patterns to recognise and return.
We can also specify strict ranges of values instead of wild-cards by placing options inside square brackets ([]
) during the pattern matching. It’s also worth noticing that the function str_replace()
will only replace the first instance in each string, whilst str_replace_all()
will replace all instances.
str_replace("Hi Mum", "[Mm]", "b")
str_replace_all("Hi Mum", "[Mm]", "b")
str_replace_all("Hi Mum", "[aeiou]", "o")
str_replace_all("Hi Mum", "[a-z]", "o")
Alternative patterns can be specified using the conventional OR
symbol |
inside the curved brackets. Think vary carefully about the following substitutions to make sure you can see what’s happening.
str_replace_all("Hi Mum", "(Mum|Dad)", "Parent")
str_replace_all("Hi Dad", "(Mum|Dad)", "Parent")
str_replace_all("Hi Dad", "Hi (Mum|Dad)", "Dear Beloved Parent")
We have only just scratched the surface of text manipulation. Many other functions that may come in handy are:
str_detect()
: Return a TRUE
or FALSE
value for each tested character stringstr_count()
: Count the number of appearances of a character.str_extract()
/ str_extract_all()
: extract just the given string from our querystr_remove()
/ str_remove_all()
: removes the given patternNote that we’ve switched to a magrittr
style of syntax here.
pcrData$Gene %>% str_detect("A")
pcrData$Gene %>% str_count("[A-C]")
colnames(pcrData) %>% str_extract("(Resting|Stim)")
colnames(pcrData) %>% str_remove_all("hr")
Whilst all of the above return a character vector of the same length as the original, some other functions do not. If we wanted to split our treatments using the column names, we could use str_split()
or more conveniently str_split_fixed()
which allows us to specify how many columns to form from the original data.
colnames(pcrData) %>% str_split_fixed("_", n = 2)