In our previous session, we introduced:
In this session, we’ll look at working with text strings.
A very convenient package for working with text in R is called stringr
, however, many of you may not even know what a package is. Put simply, a package is a collection of functions that can be used together to perform similar or related tasks. We make these functions available within an R session by executing library(myPackageName)
in the Console.
When you open an R session, a handful of packages are loaded by default, such as stats, graphics, grDevices, utils, datasets, methods and base. Each of these packages has a set of related functions. For example, sd()
, var()
, cor()
and rnorm()
all live in the stats
package as they are about distributional parameters or statistical model-fitting. Alternatively, abs()
, max()
, log()
and other functions live in base
are they are the fundamental functions required for performing many calculations. If you go to the help page for any of these, you’ll see the package name at the top left of each help page in the curly braces.
At last count, there are >17,00 packages available, and loading them all every time we start R
would take hours, so for each analysis, we just need to load the packages that are relevant to each analysis. Most packages are hosted on the CRAN
package repository, although we’ve installed most of the packages you’ll need. For this section, we’re going to manipulate text and the package we’ll look at is stringr
. As we mentioned, we’ve already placed this on your VM.
Making sure you’re in the appropriate R Project (~/transcriptomics/practical_1
), create a new R Markdown file called TextManipulation.Rmd
. Throughout this section, please write copious notes to yourself about what we are doing. We will not give you exact guidance on these notes, but remember that you are creating a reference document for yourself so you can find your way around later. Take the time to write everything in a way that you find clear. It’s far better to go slowly and understand everything, than to finish but to have no idea what you’ve done.
Once you’ve got your new R Markdown document, make sure you’re happy with the YAML header, then delete everything after the first chunk (setup
). Compile just to check you haven’t broken anything. When creating an R Markdown document, it can be good practice to compile regularly just to check that you haven’t done anything silly.
A good tip is to load all packages you need for a workflow at the beginning of the R Markdown document. This helps identify any potential conflicts between packages early, as sometimes two packages don’t play well together, and may even have identically named functions.
Once you have your new R Markdown document ready to go, start a new chunk and set the label to be load_packages
or something similar. Inside that chunk add the code
library(stringr)
When executed, this will load the package stringr
, which has a raft of functions for manipulating text. Often we’ll need to know these tricks as our sample names may have multiple genotypes or treatments embedded in them, and knowing how to get these pieces of information out can be very helpful. Other times, we need to know how to correct typos longs the lines of sex being recorded as male
, female
, M
, m
, f
or F
all in the same document. This type of thing happens more than you may realise, especially when dealing with large datasets that have a lot of manually entered values. Alternatively we may need to remove a file suffix to make our plots look nicer.
The first thing we’ll need to do to get underway, is create a character vector that we’ll work with for this session. Create new chunk called hi
and create the vector hi
as follows.
hi <- c("Hi Mum", "Hi Mother")
This is a simple character vector (of length 2) with two closely related phrases. This will be short enough for us to work with today and see all of the results clearly. In a real world context, we often work with vectors with many thousands of entries. These are generally far too long to inspect one element at a time.
The package stringr
has a whole series of functions, many of which start with str_
. From here on, we strongly advise to demonstrate each function within it’s own chunk, and for you to make notes in your R Markdown document that help you understand what we’re doing. For nearly all of the following examples, we won’t actually create a new object, but will just demonstrate the output of each function, which should appear as R output when you compile your document. Importantly, for the first set of functions we look at, we pass the function a character vector of length 2, and we will return a character vector of length 2. There are a few exceptions later on though.
We can very easily change the case of our vector, and this may be of help when our collaborators have given us same strange combination of M/m/F/f encoding for the sex of our samples. The main functions are str_to_lower()
, str_to_upper()
and str_to_title
, with final helpful function being str_to_sentence()
. All of these take one main argument, called string
in the function. When we call a function in R, we can name these arguments explicitly, or if we pass everything in order we don’t need to name these. The first two lines of code below are essentially identical.
str_to_lower(string = hi)
str_to_lower(hi)
Notice that because the first argument that str_to_lower()
requires is called string, we can name it, or just pass our character vector to the function in the first position. For the remaining three functions in this set, we can omit the argument name for convenience.
str_to_upper(hi)
str_to_title(hi)
str_to_sentence(hi)
These are quite well named functions as the name makes it clear what the output will be. This is not always the case with functions, so make sure you describe what each one does.
Often we have to remove unwanted text like file suffixes and directory names, and str_remove_all()
makes this easy. There are two arguments we need to provide to this function. The first is called string
, whilst the second argument is called pattern
. Now, we need to specify two arguments. We can either name them explicitly, or just pass all of the values in the correct position.
str_remove_all(string = hi, pattern = "M")
str_remove_all(hi, "M")
As well as specifying exact patterns to match, we can provide a set of alternative values. In the following, we’ll remove either an upper or lower case m
by providing a set of alternative characters within the square brackets. To demonstrate providing a larger set of alternative characters, we’ll follow that by removing all lower-case vowels.
str_remove_all(hi, "[Mm]")
str_remove_all(hi, "[aeiou]")
In stringr
, syntax is based on regular expressions, so we can also set a wild-card character by using the .
symbol. For those familiar with regular expressions, this will be old news, but for those who are not familiar with regexp
syntax this may be a new way of working with text. Some of you may also be familiar with using the *
symbol as a plain-text wild-card, but importantly the *
symbol doesn’t work in that manner here and actually has a completely different meaning.
str_remove_all(hi, "Mu")
str_remove_all(hi, "M.")
In this last example, we’ve removed the upper case M
and any single character that directly follows it. We could also remove any number of following characters by adding a +
symbol after the .
, which is interpreted as match anything (.) one or more times (+)
str_remove_all(hi, "M.+")
As well as just removing sections of a text string,there are various methods for reaching into a text string and grabbing the patterns we need. The first of these is str_extract()
which also takes the two arguments string
and pattern
. If we search for a pattern that doesn’t exist, str_extract()
will return a missing value (NA
).
str_extract(hi, "Mu")
str_extract(hi, "M.+")
Notice how the patterns behave in the exact same manner as we saw with str_replace_all()
, by using the regex
syntax. We can also use exact positions within a text string using str_sub()
, with the arguments string
, start
and end
. Notice that naming the start
and end
arguments explicitly is really helpful here as when you read this back in 2 years time, you’ll immediately understand what the numbers 4
and 5
represent.
str_sub(hi, start = 4, end = 5)
Sometimes, we don’t need to just find or extract one simple pattern, but we need to perform more complex manipulations. The function str_replace()
allows us to specify the arguments string
& pattern
as we’ve already seen, but also provide the third argument replacement
. Once again, the following two commands are identical, because we are providing the arguments in order.
str_replace(string = hi, pattern = "Hi", replacement = "Hello")
str_replace(hi, "Hi", "Hello")
We can also use our pattern matching tricks to replace either “Mum” or “Mother” with “Dad”. Feel free to experiment with other patterns and replacements here.
str_replace(hi, "M.+", "Dad")
As well as just using wild-cards, we can specify complete patterns as alternatives. When searching using this technique, the alternative phrases must be provided within the round braces, and they are separated by the “|” symbol, which is commonly used to represent “OR”.
str_replace(hi, "(Mum|Mother)", "Dad")
str_replace()
also allows us to ‘capture’ a pattern within each text string, and incorporate it into the replacement. To capture a pattern, we surround it in the round braces ()
, as we did above. To return the pattern, we use the shortcut \\1
which will return the first pattern we have captured. In the following, notice that we’re matching white-space followed by “(M.+)”, where the captured pattern is “M.+”. In the replacement, we’re replacing the entire pattern (including the white-space) with a long string, but this replacement string includes the captured pattern.
str_replace(hi, " (M.+)", "! We captured and replaced the pattern \\1.")
We can also capture multiple patterns, and return them in any order we choose. In the following, our first capture will be “Hi”, then we don’t capture the white-space, but then we capture the pattern “M.+” as we did before. In our replacement, we’re returning them in the opposite order that we captured them in, and we’re also including some exclamation marks and additional white-space.
str_replace(hi, "(Hi) (M.+)", "\\2! \\1!")
Just like str_remove()
and str_remove_all()
, there are two versions of str_replace()
. The first will only replace the first match to the pattern, whilst str_replace_all()
will replace all matches to the pattern. Depending on what operation you’re needing to perform, either may be suitable.
str_replace(hi, "[Mm]", "b")
str_replace_all(hi, "[Mm]", "b")
The above functions are probably the “big guns” of stringr
, however there are a few more very useful things we can do to clean up our data. If we wish to ensure that our text strings are all the same length, we can pad them with any character we choose, using str_pad()
. This takes the arguments string
, width
, side
and pad
as the four arguments. In the first example, we’re setting our new strings to be exactly 10 characters wide, and we’re padding on the right with exclamation marks. Note again, that explicitly naming the arguments here helps us understand exactly what we’re doing with the function.
str_pad(hi, width = 10, side = "right", pad = "!")
Another common use for something like this would be to pad numbers to ensure we have the same number of digits. If numbers are represented as characters, most sorting algorithms sort them alpha-numerically instead of numerically. We often see strange ordering like 1, 10, 2, 3, 4, 5, 6, 7, 8 and 9 instead of the more reasonable 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10. In the next line, we’ll add zeroes in front of the numbers to ensure every number has two digits. Once we’ve done this, alpha-numeric sorting and numeric sorting will give the same results.
str_pad(1:10, width = 2, pad = "0")
Notice that we didn’t set the argument side
here! By default, this argument will be set to side = "left"
, and we’ve relied on this value. This means that pad = "0"
will be provided in position 3 of the function, so must be explicitly named. This also demonstrates why it can be helpful to explicitly name arguments.
Sometimes we can end up with ridiculously long text strings, particularly when dealing with gene-set enrichment analyses, so we can use str_trunc()
to truncate these. By default, the last 3 characters of a truncated string will be given as the ellipsis (...
) to indicate the text has been shortened.
str_trunc(hi, width = 6)
Finally, we often see text strings which have excessive white-space, so we can remove this using str_trim()
. This will remove leading or trailing white-spaces from the text.
str_trim(" Hi!")
In all of the above, we provided a character vector of length 2 and we obtained a character vector of length 2 as our output. (NA
technically still counts as a character.) There are multiple other functions which exist that return output in different forms
The first of these might be to perform a logical test for the presence of a pattern within our text strings. str_detect()
again takes the two arguments string
and pattern
, with a third argument called negate
that defaults to FALSE
. If we set this to TRUE
, we will invert our results, which is often referred to as negating a logical search. In both cases, we no longer return a character vector, but return the logical values TRUE
and FALSE
.
str_detect(hi, "Mum")
str_detect(hi, "Mum", negate = TRUE)
These functions really come into their own when we’re dealing with spreadsheet-like structures (known as data frames) and we wish to perform some kind of filtering on our values, as we would with Auto-Filter in Excel.
Sometimes it can be convenient to look in specific positions within our text and the two functions str_starts()
and str_ends()
check to see if a string starts or ends with a specific pattern. Both of these also take the argument negate
, which we’ll leave as the default (negate = FALSE
).
str_starts(hi,"Hi")
str_ends(hi, "m")
We can also subset our initial character vector, so that only the elements which match our search pattern are returned. Here, we provide a character vector of length 2 and initially return a character vector of length 1. By incorporating regex
wildcards, we can then return any elements that contains “M.”.
str_subset(hi, "Mum")
str_subset(hi, "M.")
A final function (str_view_all()
) returns a completely different output. This matches the specified pattern and opens the results in the Viewer pane, which we haven’t used until now. This can be incredibly helpful if you’re trying to build a complex search pattern and want to check where it’s matching.
str_view_all(hi, "M.+")
Sometimes, it can be helpful to obtain information about the content of our text strings, such as their overall length, or the number of times a specific letter or pattern appear. I’m sure you’ll immediately realise this can be very useful if wanting to assess the nucleotide content of a DNA/RNA sequence. In NextGen Sequencing experiments, the GC content of a sequence is a known factor that biases our ability to sequence a DNA fragment. In all of these, we’ll provide our input as a character vector with length 2, but will obtain an integer vector of the same length as the output.
str_length(hi)
str_count(hi, "[Hh]")
str_count("AGCTGCGCGATTTAGC", "[GC]")
Sometimes, we need to condense multiple text strings into a single string, and we can use str_flatten
for this.
str_flatten(hi, collapse = ", ")
However, there are two base
functions paste()
and paste0()
that are just as helpful here, and probably a little more flexible.
paste(hi, collapse = ", ")
In addition to performing the above concatenation, we more commonly use the argument sep
within the function paste()
to join additional text to a string. In the following, we’re pasting an additional phrase to each of our initial vector elements, and specifying the separator between the phrases to be ", "
.
paste(hi, "I hope you're well.", sep = ", ")
The function paste0()
is a simple wrapper to paste()
but with sep = ""
as the default.
paste0(hi, "!")
Finally, a common operation is to separate a single text string into multiple text strings based on a specific character, or set of characters. The function that gives us the most convenient output for this is str_split_fixed()
, where we provide the arguments string
, pattern
and n
, with the final argument controlling how many strings to obtain from the initial string. In the following, we’re splitting our text based on white-space, and ensuring that each string is only split into 2 new strings. This provides output in a structure known as a matrix
, which we’ll explore further in the course.
str_split_fixed(hi, " ", 2)
In general, when performing this type of operation, we’re looking for something specific and we’ll know what type of output we need. There is a slightly more flexible function str_split()
, however the output format from str_split()
is beyond the scope of our knowledge for now.