12 Strings and regular expressions

Summary

Regular expressions (aka regex) are used to search for patterns of symbols within a list of symbols (a string).
str_detect returns TRUE if the pattern appears in the text, and FALSE if it does not.
An anchor in a regex limits the pattern matching to occur either at the beginning or end of a string.
The ^ anchor changes the pattern so that it only occurs at the beginning of the string.
The $ anchor changes the pattern so that it only occurs at the end of the string.
The . symbol is a wildcard. When it appears in a regex it matches any symbol in a string.
Bracket notation is a way to create a pattern that matches more than one symbol in a string. Hyphens can designate a range of possibilities, for instance, [a-d] matches either a, b, c, or d.
str_extract returns the first part of the string that matches the regex pattern if it appears in the string, and NA if it does not.
str_extract_all returns every part of the string that matches the regex pattern if it appears in the string, and NA if it does not.
Inside of brackets, the ^ is a negation operator, matching everything that does not match the pattern in the brackets. For instance, [1-5] matches the digits 1 through 5, while [^1-5] matches any symbol that is not the digits 1 through 5.
The + operator changes the pattern to match the text in the string if it appears one or more times.
The $\ast$ operator (called the Kleene star) changes the pattern to match the text in the string if it appears zero or more times.
The ? operator changes the pattern to match the text in the string if it appears zero or one time.
The {n} operator repeats the last part of the pattern $n$ times exactly.

12.1 Dealing with textual data

Often, variables in datasets take on textual values. For instance, consider the following snippet of data from the Global Historical Climatology Network–Daily (GHCN-Daily) dataset.

ghcn1 <- tribble(
  ~key, ~lat, ~lon, ~un1, ~station, ~type, ~gsnnum,
  "ACW00011604", 17.1167,  -61.7833,   10.1,    "ST JOHNS COOLIDGE FLD", NA, NA,    
  "ACW00011647", 17.1333,  -61.7833,   19.2,    "ST JOHNS", NA, NA,                  
  "AE000041196", 25.3330,   55.5170,   34.0,    "SHARJAH INTER. AIRP", "GSN",     41196,
  "AF000040930", 35.3170,   69.0170, 3366.0,    "NORTH-SALANG",        "GSN",     40930,
  "AG000060390", 36.7167,    3.2500,   24.0,    "ALGER-DAR EL BEIDA",  "GSN",     60390,
  "AG000060590", 30.5667,    2.8667,  397.0,    "EL-GOLEA",            "GSN",     60590,
  "AG000060611", 28.0500,    9.6331,  561.0,    "IN-AMENAS",           "GSN",     60611,
  "AG000060680", 22.8000,    5.4331, 1362.0,    "TAMANRASSET",         "GSN",     60680
)

Note that some of the stations are given the type value of GSN. This is an acronym for the Global Seismographic Network, a set of stations operated by the United States Geographical Survey, the National Science Foundation, and the Incorporated Research Institutions for Seismology around the world.

To search for only those observations with type equal to GSN, the filter could be used.

ghcn1 |> filter(type == "GSN")

## # A tibble: 6 × 7
##   key           lat   lon   un1 station           type  gsnnum
##   <chr>       <dbl> <dbl> <dbl> <chr>             <chr>  <dbl>
## 1 AE000041196  25.3 55.5     34 SHARJAH INTER. A… GSN    41196
## 2 AF000040930  35.3 69.0   3366 NORTH-SALANG      GSN    40930
## 3 AG000060390  36.7  3.25    24 ALGER-DAR EL BEI… GSN    60390
## 4 AG000060590  30.6  2.87   397 EL-GOLEA          GSN    60590
## 5 AG000060611  28.0  9.63   561 IN-AMENAS         GSN    60611
## 6 AG000060680  22.8  5.43  1362 TAMANRASSET       GSN    60680

Now suppose that the goal is to simply find a string within the data, rather than have it be the exact string. For instance, there are two stations with ST JOHNS in the name. But only one will be exactly equal to ST JOHNS.

ghcn1 |> filter(station == "ST JOHNS")

## # A tibble: 1 × 7
##   key           lat   lon   un1 station  type  gsnnum
##   <chr>       <dbl> <dbl> <dbl> <chr>    <chr>  <dbl>
## 1 ACW00011647  17.1 -61.8  19.2 ST JOHNS <NA>      NA

To find all station names that contain ST JOHNS, a regular expression will be used.

A regular expression is an encoding of a pattern to be searched for within a string of text. These are also called regex for short.

12.2 The stringr package

Regular expressions are implemented in R using the stringr package, which is loaded as part of the larger tidyverse package. This package contains several commands allowing the user to test for patterns described by regular expressions against data very efficiently.

For instance, suppose that the goal is just to find any string that contains ST JOHNS. Then the str_detect function can be used with the input "ST JOHNS" to detect if these letters appear anywhere in the string. The format is slightly different than when using ==. The first argument to str_detect is the vector of string values, and the second argument is the regex.

ghcn1 |> filter(str_detect(station, "ST JOHNS"))

## # A tibble: 2 × 7
##   key           lat   lon   un1 station           type  gsnnum
##   <chr>       <dbl> <dbl> <dbl> <chr>             <chr>  <dbl>
## 1 ACW00011604  17.1 -61.8  10.1 ST JOHNS COOLIDG… <NA>      NA
## 2 ACW00011647  17.1 -61.8  19.2 ST JOHNS          <NA>      NA

Recall that filter requires an argument that is a vector of TRUE/FALSE values, one for each observation. So what str_detect returns is exactly such a vector.

example_string1 <- c(
  "ST JOHNS",
  "ST JOHNS COOLIDGE FLD",
  "VAST JOHNS",
  "PITTSBURGH",
  "SV JOHNS AREA 51",
  "ST. JOHNS",
  "ST JOHNS AND ST JOHNS"
)
example_string1 |> str_detect("ST JOHNS")

## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE

The first three entries and the last entry are TRUE because the string "ST JOHNS" appears somewhere within the string. Note that the regular expression still finds "ST JOHNS" in "VAST JOHNS", since it does not care about word breaks. The last entry is true because it is looking for the pattern appearing at least once. The fact that it appears twice is immaterial.

12.3 Anchors

What if the goal was only to find the characters "ST JOHNS" at the beginning of the string? Then the regex could be modified by putting an anchor that looks for the pattern either at the beginning or end of the string. In particular, the circumflex symbol ^ will only match the pattern if it occurs at the start of the regex.

example_string1 |> str_detect("^ST JOHNS")

## [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE

What if the goal was to find these characters at the end of the string? Then put the dollar sign symbol $ at the end of the regex.

example_string1 |> str_detect("ST JOHNS$")

## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE

If the goal was to find it at the beginning and at the end of the string, use both the ^ and $ symbols.

example_string1 |> str_detect("^ST JOHNS$")

## [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

This formulation of ^string$ is equivalent to searching if the text exactly equals the string in question.

But wait a minute, doesn’t ST JOHNS VS ST JOHNS both begin and end with ST JOHNS. So why is it detected as FALSE?

The answer lies in what regular expressions look for. They are looking for a pattern that satisfies all the constraints laid out in the regex. That means that it is looking for a set of contiguous characters in the string with three properties.

The ^ means that the set of contiguous characters must occur at the beginning of the string.
The ST JOHNS means that the set of contiguous characters must be of length eight, and match the characters exactly.
The $ means that the set of contiguous characters must be at the end of the string.

For the regex to return TRUE, all three of these properties must hold for the same set of characters. You cannot use one part of the string to match some of the properties and another part of the string to match the other properties. All properties must be satisfied by the same set of characters within the string.

12.4 Wildcards

A wildcard character in a regex matches more than one symbol.

Wildcards can be very general, the most general is to match any character. In R, the wildcard for matching any symbol is the period, ..

Recall that the fifth entry in example_string1 is SV JOHNS AREA 51, and so does not match ST JOHNS.

example_string1 |> str_detect("ST JOHNS")

## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE

Now suppose that the user does not care about the second character. By replacing it with the wildcard symbol ., it can also match the V in SV JOHNS.

example_string1 |> str_detect("S. JOHNS")

## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE

12.5 Brackets

What if the goal is to see if the string contains any digits 0, 1, 2, up to 9? Then a pattern for digits can be formed by using the bracket construction [0-9]. This will accept any of the values 0 through 9.

example_string1 |> str_detect("[0-9]")

## [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

Here SV JOHNS AREA 51 is the only one with a digit.

This type of pattern can also be used with letters. The following finds any string that uses the letters A through C.

example_string1 |> str_detect("[A-C]")

## [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

Note this captures PITTSBURGH because it has a B in it. Here it is important to specify uppercase or lowercase letters. None of the strings have any lowercase letters, so searching for [a-c] returns all FALSE values.

example_string1 |> str_detect("[a-c]")

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

What if the goal was to match either a digit 0 through 9, or the letter B? Then both could be put into the bracket.

example_string1 |> str_detect("[B0-9]")

## [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

12.6 String extraction

So far str_detect has been used to generate just TRUE or FALSE. This is enough to use to filter, but sometimes the match itself is needed. This is the job of str_extract, it shows exactly where the pattern was first found.

example_string1 |> str_extract("[B0-9]")

## [1] NA  NA  NA  "B" "5" NA  NA

example_string1 |> str_extract("ST JOHNS")

## [1] "ST JOHNS" "ST JOHNS" "ST JOHNS" NA         NA         NA         "ST JOHNS"

Note that even though the last string ST JOHNS AND ST JOHNS contains the last string twice, it only shows up once in the report. This is because once it finds the string, str_extract stops looking at the string. To find all the occurrences of a pattern in a string, use str_extract_all.

example_string1 |> str_extract_all("ST JOHNS")

## [[1]]
## [1] "ST JOHNS"
## 
## [[2]]
## [1] "ST JOHNS"
## 
## [[3]]
## [1] "ST JOHNS"
## 
## [[4]]
## character(0)
## 
## [[5]]
## character(0)
## 
## [[6]]
## character(0)
## 
## [[7]]
## [1] "ST JOHNS" "ST JOHNS"

The output is a list, with the elements of the list being the multiple times that the pattern was found.

12.7 Replacing parts of a string

Earlier, it was seen that it is sometimes necessary to replace strings matched in a pattern by other strings. The str_replace function can be used to do this. Consider conserving vowels by replacing the pattern JOHNS found by JHNS. This can be done in our example strings as follows.

example_string1 |> str_replace("JOHNS", "JHNS")

## [1] "ST JHNS"              "ST JHNS COOLIDGE FLD" "VAST JHNS"           
## [4] "PITTSBURGH"           "SV JHNS AREA 51"      "ST. JHNS"            
## [7] "ST JHNS AND ST JOHNS"

In strings where the pattern is not found (like PITTSBURGH) no replacement was done. However, in the last entry only the first JOHNS was changed, and not the second. Once again, adding _all to the command gives the function str_replace_all that accomplishes both tasks.

example_string1 |> str_replace_all("JOHNS", "JHNS")

## [1] "ST JHNS"              "ST JHNS COOLIDGE FLD" "VAST JHNS"           
## [4] "PITTSBURGH"           "SV JHNS AREA 51"      "ST. JHNS"            
## [7] "ST JHNS AND ST JHNS"

Now every occurrence has changed.

12.8 Repeating patterns

If the goal is just to detect one or more digits in a string, then str_detect(string, "[0-9]") works fine. However, when str_extract is being used, often the goal is to match a consecutive string of numbers. The Kleene star can be used to do this.

Consider the following table, drawn from the map of Epcot at Walt Disney World for May 2022

epcot_map <- c("Spaceship Earth1",
               "Creations Shop9",
               "Living with the Land18",
               "Turtle Talk With Crush20")

Unfortunately, there is no comma separating the first part of the name from the number attached to the key at the end. To make matters more difficult, the number of digits at the end is different for different observations.

Putting a + after a bracketed expression makes it match one or more copies of it. To see the difference, first consider a regex without the +.

epcot_map |> str_extract("[0-9]")

## [1] "1" "9" "1" "2"

Now with the + it can match one or more digits.

epcot_map |> str_extract("[0-9]+")

## [1] "1"  "9"  "18" "20"

If the data was given as a tibble, then mutate can be used with str_extract to break off this information.

epcot_map_tibble <- 
  tibble(epcot_map) |>
  mutate(number = str_extract(epcot_map, "[0-9]+"))
epcot_map_tibble |> kable() |> kable_styling()

epcot_map	number
Spaceship Earth1	1
Creations Shop9	9
Living with the Land18	18
Turtle Talk With Crush20	20

12.8.1 Negation

There is still the problem of how to extract the information other than the digits. Here, the negation operator inside brackets can be used.

That is where things get complicated. Earlier, it was shown that ^ in a regex expression makes what comes next only match the beginning of the expression. However, when ^ appears inside a bracketed expression, it matches anything except what is tested for in the brackets. So [^0-9] matches anything except the digits 0 through 9. This can be used to pull out the rest of the information. The select function can then be used to remove the original data variable.

epcot_map_tibble <- 
  tibble(epcot_map) |>
  mutate(name   = str_extract(epcot_map, "[^0-9]+")) |>
  mutate(number = str_extract(epcot_map, "[0-9]+")) |>
  select(-epcot_map)

epcot_map_tibble |> kable() |> kable_styling()

name	number
Spaceship Earth	1
Creations Shop	9
Living with the Land	18
Turtle Talk With Crush	20

The + sign is a variant of what is called the Kleene Star. Stephan Kleene was a mathematician and early theoretical computer scientist who wanted a way to extend the abilities of regular expressions.

The Kleene Star, *, in a regular expression indicates that the pattern that appears previously occurs 0, 1, 2, or more times in a row.

12.8.2 The zero-one repeating variant

In addition to the + variant, the Kleene star has another variant ? which looks for the repetition of the pattern either zero or one time.

12.9 Repeating a pattern a specific number of times

Sometimes (as with phone numbers), a very specific number of digits need to appear. In the US, a phone number (including area code) has 10 digits. Using [0-9]{10} will match ten digits exactly.

phone_numbers <-
  c("9095553472",
    "909555",
    "909 555 3472",
    "(909)555 3472",
    "909-555-3472")

phone_numbers |> str_extract("[0-9]{10}")

## [1] "9095553472" NA           NA           NA           NA

Unfortunately, putting in extra spaces, parentheses, or hyphens will mess up the extraction. Here the ? operator can be used to deal with this.

phone_numbers |> str_extract("[(]?[0-9]{3}[) -]?[0-9]{3}[ -]?[0-9]{4}")

## [1] "9095553472"    NA              "909 555 3472"  "(909)555 3472"
## [5] "909-555-3472"

Questions

Consider the regular expression 34.[a-g]. Which of the following strings does this pattern match?

"345b"
"34b5"
"435b"
"34gb"

Write down a regular expression that matches a digit, followed by any character, followed by the letter a, followed by any lowercase letter.

Consider the mtcars dataset. The row_names_to_column function can be used to make the row names the first columns of data.

mtcars |>
  rownames_to_column("Model_name") |>
  head()

##          Model_name  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Using str_detect and filter, find all the observations with car names that start with "Merc".
Using str_detect and filter, find all the observations with car names that have at least one digit in them.

Consider the USArrests dataset.

Use str_detect and filter to find all the observations from states that begin with the letter A.
Use str_detect and filter to find all the observations from states that end with the letter a.
Use str_detect and filter to find all the observations from states whose name begins with A and ends with a.

Consider the following code.

test <- c("Uranium Fever", "New Moon", "Ain't that a kick in the head")
str_extract(test, "[a-zA-Z]+$")

## [1] "Fever" "Moon"  "head"

str_extract(test, "^[a-zA-Z]+")

## [1] "Uranium" "New"     "Ain"

str_extract(test, "[^a-zA-Z]+")

## [1] " " " " "'"

Describe in words what the regular expression [a-zA-Z]+$ is trying to find.
Describe in words what the regular expression ^[a-zA-Z]+ is trying to find.
Describe in words what the regular expression [^a-zA-Z]+ is trying to find.

Consider the dataset painters from the MASS library. First, load in the library.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

Then the dataset looks as follows.

painters |> head()

##               Composition Drawing Colour Expression School
## Da Udine               10       8     16          3      A
## Da Vinci               15      16      4         14      A
## Del Piombo              8      13     16          7      A
## Del Sarto              12      16      9          8      A
## Fr. Penni               0      15      8          0      A
## Guilio Romano          15      16      4         14      A

Find the observations where the painter name has at least one space in it.
Using your tibble from part a., create a column after_space which lists the part of the painters name after the last space in the name.

Suppose you have a vector of strings in s.

Write code to detect for each string in the vector if it contains the letter a followed by a digit at least once.
Write code that matches the pattern given in part a, but instead of returning true or false, returns the two characters that match the pattern if it exists in the string.

Consider a tibble:

phone_numbers <-
  tribble(
  ~Name, ~Phone,
  "Mr. Plow", "636-555-3226",
  "Big Mean Carla", "323-555-0129",
  "Elmo", "212-555-6666",
  "Phoenix Biogenics", "225-330-7040"
)

The area code for a U.S. phone number consists of the first three digits.

Write code to add a column area_code to the tibble which consists of the first three digits of the phone number.
Write code to filter phone_numbers to only include numbers where the middle three digits are 555.

Give a regular expression that matches one or more digits at the beginning of a string.
Give a regular expression that matches zero or more digits at the end of a string.

Consider the following tibble.

data <- 
  tibble(
    s = c("blue141car", "red314159truck", "yellow2718airplane")
  )

Using the separate command, use a regular expression for the sep parameter to separate the s variable into before_number and after_number.