1.2.3: Working With Text

16 April 2019

Text Manipulation

The package stringr contains functions for text manipulation
- Another core package from tidyverse
Some basic knowledge of regular expressions is helpful
Functions str_detect(), str_extract(), str_replace()
Alternatives to grepl(), grep(), gsub() etc from base

`stringr::str_detect()`

library(tidyverse)
x <- c("Hi Mum", "Hi Mother")

str_detect() returns a logical vector

str_detect(string = x, pattern = "Mum")
str_detect(string = x, pattern = "Hi")

`stringr::str_detect()`

We can use common regex syntax:

Alternatives are specified with []

str_detect(x, "h")
str_detect(x, "[Hh]")

Wild-cards are specified using .

str_detect(x, "Mo")
str_detect(x, "M.")

`stringr::str_extract()`

We can use str_extract() to extract patterns

str_extract(string = x, pattern = "Hi M.")

This can be helpful if no matches are found

str_extract(x, "Mum")

`stringr::str_replace()`

Common syntax for extracting/modifying text strings

str_replace(x, pattern = "Mum", replacement = "Dad")

Searching the string "Hi Mum" for the pattern "Mum", and
Replacing the first instance of "Mum" with "Dad"

`stringr::str_replace()`

We can specify wild-cards as .

str_replace(x, "M.", "Da")

We can also match any number of wild-cards by using +

str_replace(x, "M.+", "Dad")

`stringr::str_replace()`

We can also capture words/phrases/patterns using (pattern)

str_replace(x, "(Hi) (M.+)", "\\2! \\1!")

Patterns are numbered in the order they are "captured"

`stringr::str_replace()`

We can also specify alternatives instead of wild-cards ([])

str_replace(x, "[Mm]", "b")

str_replace() only replaces the first match in a string
str_replace_all() replaces all matches

str_replace_all(x, "[Mm]", "b")

`stringr::str_replace()`

Alternative patterns can be specified using the conventional OR symbol |

str_replace(x, "(Mum|Mother)", "Maternal Parent")

More Helpful Functions

str_count(x, "[Mm]")
str_length(x)
str_split_fixed(x, pattern = " ", n = 2)
str_to_lower(x)
str_to_title("a bad example")
str_pad(c("1", "10", "100"), width = 3, pad = "0")

Factors

A common data type in statistics is a categorical variable (i.e. a factor)

Data will be a set of common groups

pet_vec <- c("Dog", "Dog", "Cat", "Dog", "Cat")

This is a character vector

Factors

We can simply coerce this to a vector of factors
Categories will automatically be assigned alphabetically

pet_factors <- as.factor(pet_vec)
pet_factors

We can manually set these categories as levels

pet_factors <- factor(pet_vec, levels = c("Dog", "Cat"))

Factors

These are actually stored as integers
Each integer corresponds to a level

str(pet_factors)
as.integer(pet_factors)
as.character(pet_factors)

Factors

What would happen if we think a factor is a character, and we use it to select values from a vector/matrix/data.frame?

Factors

What would happen if we think a factor is a character, and we use it to select values from a vector/matrix/data.frame?

names(pet_vec) <- pet_vec
pet_vec
pet_vec[pet_factors]
pet_vec[as.character(pet_factors)]

This is why I'm very cautious about read.csv() and the standard data.frame etc