12 Strings and regular expressions
Summary
Regular expressions (aka regex) are used to search for patterns of symbols within a list of symbols (a string).
str_detect returns TRUE if the pattern appears in the text, and FALSE if it does not.
An anchor in a regex limits the pattern matching to occur either at the beginning or end of a string.
The
^anchor changes the pattern so that it only occurs at the beginning of the string.The
$anchor changes the pattern so that it only occurs at the end of the string.The
.symbol is a wildcard. When it appears in a regex it matches any symbol in a string.Bracket notation is a way to create a pattern that matches more than one symbol in a string. Hyphens can designate a range of possibilities, for instance,
[a-d]matches eithera,b,c, ord.str_extract returns the first part of the string that matches the regex pattern if it appears in the string, and NA if it does not.
str_extract_all returns every part of the string that matches the regex pattern if it appears in the string, and NA if it does not.
Inside of brackets, the
^is a negation operator, matching everything that does not match the pattern in the brackets. For instance,[1-5]matches the digits 1 through 5, while[^1-5]matches any symbol that is not the digits 1 through 5.The
+operator changes the pattern to match the text in the string if it appears one or more times.The \(\ast\) operator (called the Kleene star) changes the pattern to match the text in the string if it appears zero or more times.
The
?operator changes the pattern to match the text in the string if it appears zero or one time.The
{n}operator repeats the last part of the pattern \(n\) times exactly.
12.1 Dealing with textual data
Often, variables in datasets take on textual values. For instance, consider the following snippet of data from the Global Historical Climatology Network–Daily (GHCN-Daily) dataset.
ghcn1 <- tribble(
~key, ~lat, ~lon, ~un1, ~station, ~type, ~gsnnum,
"ACW00011604", 17.1167, -61.7833, 10.1, "ST JOHNS COOLIDGE FLD", NA, NA,
"ACW00011647", 17.1333, -61.7833, 19.2, "ST JOHNS", NA, NA,
"AE000041196", 25.3330, 55.5170, 34.0, "SHARJAH INTER. AIRP", "GSN", 41196,
"AF000040930", 35.3170, 69.0170, 3366.0, "NORTH-SALANG", "GSN", 40930,
"AG000060390", 36.7167, 3.2500, 24.0, "ALGER-DAR EL BEIDA", "GSN", 60390,
"AG000060590", 30.5667, 2.8667, 397.0, "EL-GOLEA", "GSN", 60590,
"AG000060611", 28.0500, 9.6331, 561.0, "IN-AMENAS", "GSN", 60611,
"AG000060680", 22.8000, 5.4331, 1362.0, "TAMANRASSET", "GSN", 60680
)Note that some of the stations are given the type value of GSN. This is an acronym for the Global Seismographic Network, a set of stations operated by the United States Geographical Survey, the National Science Foundation, and the Incorporated Research Institutions for Seismology around the world.
To search for only those observations with type equal to GSN, the filter could be used.
## # A tibble: 6 × 7
## key lat lon un1 station type gsnnum
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 AE000041196 25.3 55.5 34 SHARJAH INTER. A… GSN 41196
## 2 AF000040930 35.3 69.0 3366 NORTH-SALANG GSN 40930
## 3 AG000060390 36.7 3.25 24 ALGER-DAR EL BEI… GSN 60390
## 4 AG000060590 30.6 2.87 397 EL-GOLEA GSN 60590
## 5 AG000060611 28.0 9.63 561 IN-AMENAS GSN 60611
## 6 AG000060680 22.8 5.43 1362 TAMANRASSET GSN 60680
Now suppose that the goal is to simply find a string within the data, rather than have it be the exact string. For instance, there are two stations with ST JOHNS in the name. But only one will be exactly equal to ST JOHNS.
## # A tibble: 1 × 7
## key lat lon un1 station type gsnnum
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 ACW00011647 17.1 -61.8 19.2 ST JOHNS <NA> NA
To find all station names that contain ST JOHNS, a regular expression will be used.
A regular expression is an encoding of a pattern to be searched for within a string of text. These are also called regex for short.
12.2 The stringr package
Regular expressions are implemented in R using the stringr, which is loaded as part of the tidyverse package. This package contains several commands allowing the user to test for patterns described by regular expressions against data very efficiently.
For instance, suppose that the goal is just to find any string that contains ST JOHNS. Then the str_detect function can be used with the input "ST JOHNS" to detect if these letters appear anywhere in the string. The format is slightly different than when using ==. The first argument to str_detect is the vector of string values, and the second argument is the regex.
## # A tibble: 2 × 7
## key lat lon un1 station type gsnnum
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
## 1 ACW00011604 17.1 -61.8 10.1 ST JOHNS COOLIDG… <NA> NA
## 2 ACW00011647 17.1 -61.8 19.2 ST JOHNS <NA> NA
Recall that filter requires an argument that is a vector of TRUE/FALSE values, one for each observation. So what str_detect returns is exactly such a vector.
example_string1 <- c(
"ST JOHNS",
"ST JOHNS COOLIDGE FLD",
"VAST JOHNS",
"PITTSBURGH",
"SV JOHNS AREA 51",
"ST. JOHNS",
"ST JOHNS AND ST JOHNS"
)
example_string1 |> str_detect("ST JOHNS")## [1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE
The first three entries and the last entry are TRUE because the string "ST JOHNS" appears somewhere within the string. Note that the regular expression still finds "ST JOHNS" in "VAST JOHNS", since it does not care about word breaks. The last entry is true because it is looking for the pattern appearing at least once. The fact that it appears twice is immaterial.
12.3 Anchors
What if the goal was only to find the characters "ST JOHNS" at the beginning of the string? Then the regex could be modified by putting an anchor that looks for the pattern either at the beginning or end of the string. In particular, the circumflex symbol ^ will only match the pattern if it occurs at the start of the regex.
## [1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE
What if the goal was to find these characters at the end of the string? Then put the dollar sign symbol $ at the end of the regex.
## [1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE
If the goal was to find it at the beginning and at the end of the string, use both the ^ and $ symbols.
## [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
This formulation of ^string$ is equivalent to searching if the text exactly equals the string in question.
But wait a minute, doesn’t ST JOHNS VS ST JOHNS both begin and end with ST JOHNS. So why is it detected as FALSE?
The answer lies in what regular expressions look for. They are looking for a pattern that satisfies all the constraints laid out in the regex. That means that it is looking for a set of contiguous characters in the string with three properties.
The
^means that the set of contiguous characters must occur at the beginning of the string.The
ST JOHNSmeans that the set of contiguous characters must be of length eight, and match the characters exactly.The
$means that the set of contiguous characters must be at the end of the string.
For the regex to return TRUE, all three of these properties must hold for the same set of characters. You cannot use one part of the string to match some of the properties and another part of the string to match the other properties. All properties must be satisfied by the same set of characters within the string.
12.4 Wildcards
A wildcard character in a regex matches more than one symbol.
Wildcards can be very general, the most general is to match any character. In R, the wildcard for matching any symbol is the period, ..
Recall that the fifth entry in example_string1 is SV JOHNS AREA 51, and so does not match ST JOHNS.
## [1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE
Now suppose that the user does not care about the second character. By replacing it with the wildcard symbol ., it can also match the V in SV JOHNS.
## [1] TRUE TRUE TRUE FALSE TRUE FALSE TRUE
12.5 Brackets
What if the goal is to see if the string contains any digits 0, 1, 2, up to 9? Then a pattern for digits can be formed by using the bracket construction [0-9]. This will accept any of the values 0 through 9.
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
Here SV JOHNS AREA 51 is the only one with a digit.
This type of pattern can also be used with letters. The following finds any string that uses the letters A through C.
## [1] FALSE TRUE TRUE TRUE TRUE FALSE TRUE
Note this captures PITTSBURGH because it has a B in it. Here it is important to specify uppercase or lowercase letters. None of the strings have any lowercase letters, so searching for [a-c] returns all FALSE values.
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
What if the goal was to match either a digit 0 through 9, or the letter B? Then both could be put into the bracket.
## [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE
12.6 String extraction
So far str_detect has been used to generate just TRUE or FALSE. This is enough to use to filter, but sometimes the match itself is needed. This is the job of str_extract, it shows exactly where the pattern was first found.
## [1] NA NA NA "B" "5" NA NA
## [1] "ST JOHNS" "ST JOHNS" "ST JOHNS" NA NA NA "ST JOHNS"
Note that even though the last string ST JOHNS AND ST JOHNS contains the last string twice, it only shows up once in the report. This is because once it finds the string, str_extract stops looking at the string. To find all the occurrences of a pattern in a string, use str_extract_all.
## [[1]]
## [1] "ST JOHNS"
##
## [[2]]
## [1] "ST JOHNS"
##
## [[3]]
## [1] "ST JOHNS"
##
## [[4]]
## character(0)
##
## [[5]]
## character(0)
##
## [[6]]
## character(0)
##
## [[7]]
## [1] "ST JOHNS" "ST JOHNS"
The output is a list, with the elements of the list being the multiple times that the pattern was found.
12.7 Replacing parts of a string
Earlier, it was seen that it is sometimes necessary to replace strings matched in a pattern by other strings. The str_replace function can be used to do this. Consider conserving vowels by replacing the pattern JOHNS found by JHNS. This can be done in our example strings as follows.
## [1] "ST JHNS" "ST JHNS COOLIDGE FLD" "VAST JHNS"
## [4] "PITTSBURGH" "SV JHNS AREA 51" "ST. JHNS"
## [7] "ST JHNS AND ST JOHNS"
In strings where the pattern is not found (like PITTSBURGH) no replacement was done. However, in the last entry only the first JOHNS was changed, and not the second. Once again, adding _all to the command gives the function str_replace_all that accomplishes both tasks.
## [1] "ST JHNS" "ST JHNS COOLIDGE FLD" "VAST JHNS"
## [4] "PITTSBURGH" "SV JHNS AREA 51" "ST. JHNS"
## [7] "ST JHNS AND ST JHNS"
Now every occurrence has changed.
12.8 Repeating patterns
If the goal is just to detect one or more digits in a string, then str_detect(string, "[0-9]") works fine. However, when str_extract is being used, often the goal is to match a consecutive string of numbers. The Kleene star can be used to do this.
Consider the following table, drawn from the map of Epcot at Walt Disney World for May 2022
epcot_map <- c("Spaceship Earth1",
"Creations Shop9",
"Living with the Land18",
"Turtle Talk With Crush20")Unfortunately, there is no comma separating the first part of the name from the number attached to the key at the end. To make matters more difficult, the number of digits at the end is different for different observations.
Putting a + after a bracketed expression makes it match one or more copies of it. To see the difference, first consider a regex without the +.
## [1] "1" "9" "1" "2"
Now with the + it can match one or more digits.
## [1] "1" "9" "18" "20"
If the data was given as a tibble, then mutate can be used with str_extract to break off this information.
epcot_map_tibble <-
tibble(epcot_map) |>
mutate(number = str_extract(epcot_map, "[0-9]+"))
epcot_map_tibble |> kable() |> kable_styling()| epcot_map | number |
|---|---|
| Spaceship Earth1 | 1 |
| Creations Shop9 | 9 |
| Living with the Land18 | 18 |
| Turtle Talk With Crush20 | 20 |
12.8.1 Negation
There is still the problem of how to extract the information other than the digits. Here, the negation operator inside brackets can be used.
That is where things get complicated. Earlier, it was shown that ^ in a regex expression makes what comes next only match the beginning of the expression. However, when ^ appears inside a bracketed expression, it matches anything except what is tested for in the brackets. So [^0-9] matches anything except the digits 0 through 9. This can be used to pull out the rest of the information. The select function can then be used to remove the original data variable.
epcot_map_tibble <-
tibble(epcot_map) |>
mutate(name = str_extract(epcot_map, "[^0-9]+")) |>
mutate(number = str_extract(epcot_map, "[0-9]+")) |>
select(-epcot_map)
epcot_map_tibble |> kable() |> kable_styling()| name | number |
|---|---|
| Spaceship Earth | 1 |
| Creations Shop | 9 |
| Living with the Land | 18 |
| Turtle Talk With Crush | 20 |
The + sign is a variant of what is called the Kleene Star. Stephan Kleene was a mathematician and early theoretical computer scientist who wanted a way to extend the abilities of regular expressions.
The Kleene Star, *, in a regular expression indicates that the pattern that appears previously occurs 0, 1, 2, or more times in a row.
12.9 Repeating a pattern a specific number of times
Sometimes (as with phone numbers), a very specific number of digits need to appear. In the US, a phone number (including area code) has 10 digits. Using [0-9]{10} will match ten digits exactly.
phone_numbers <-
c("9095553472",
"909555",
"909 555 3472",
"(909)555 3472",
"909-555-3472")
phone_numbers |> str_extract("[0-9]{10}")## [1] "9095553472" NA NA NA NA
Unfortunately, putting in extra spaces, parentheses, or hyphens will mess up the extraction. Here the ? operator can be used to deal with this.
## [1] "9095553472" NA "909 555 3472" "(909)555 3472"
## [5] "909-555-3472"
Questions
Consider the regular expression 34.[a-g]. Which of the following strings does this pattern match?
"345b""34b5""435b""34gb"
Write down a regular expression that matches a digit, followed by any character, followed by the letter a, followed by any lowercase letter.
Consider the mtcars dataset. The row_names_to_column function can be used to make the row names the first columns of data.
## Model_name mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Using
str_detectandfilter, find all the observations with car names that start with"Merc".Using
str_detectandfilter, find all the observations with car names that have at least one digit in them.
Consider the USArrests dataset.
Use
str_detectandfilterto find all the observations from states that begin with the letterA.Use
str_detectandfilterto find all the observations from states that end with the lettera.Use
str_detectandfilterto find all the observations from states that begins withAand ends witha.
Consider the following code.
test <- c("Uranium Fever", "New Moon", "Ain't that a kick in the head")
str_extract(test, "[a-zA-Z]+$")## [1] "Fever" "Moon" "head"
## [1] "Uranium" "New" "Ain"
## [1] " " " " "'"
Describe in words what the regular expression
[a-zA-Z]+$is trying to find.Describe in words what the regular expression
^[a-zA-Z]+is trying to find.Describe in words what the regular expression
[^a-zA-Z]+is trying to find.
Consider the dataset painters from the MASS library. First, load in the library.
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
Then the dataset looks as follows.
## Composition Drawing Colour Expression School
## Da Udine 10 8 16 3 A
## Da Vinci 15 16 4 14 A
## Del Piombo 8 13 16 7 A
## Del Sarto 12 16 9 8 A
## Fr. Penni 0 15 8 0 A
## Guilio Romano 15 16 4 14 A
Find the observations where the painter name has at least one space in it.
Using your tibble from part a., create a column
after_spacewhich lists the part of the painters name after the last space in the name.
Suppose you have a vector of strings in s.
Write code to detect for each string in the vector if it contains the letter
afollowed by a digit at least once.Write code that matches the pattern given in part a, but instead of returning true or false, returns the two characters that match the pattern if it exists in the string.
Consider a tibble:
phone_numbers <-
tribble(
~Name, ~Phone,
"Mr. Plow", "636-555-3226",
"Big Mean Carla", "323-555-0129",
"Elmo", "212-555-6666",
"Phoenix Biogenics", "225-330-7040"
)The area code for a U.S. phone number consists of the first three digits.
Write code to add a column
area_codeto the tibble which consists of the first three digits of the phone number.Write code to filter
phone_numbersto only include numbers where the middle three digits are555.
Give a regular expression that matches one or more digits at the beginning of a string.
Give a regular expression that matches zero or more digits at the end of a string.