13 Backslashes
Summary
Escape characters are combinations of characters to match a specific symbol. The escape characters in regex in R all consist of a backslash
\followed by a single character. Examples include\(for a left parenthesis,\.for a period, and\\for a backslash.To create a single backslash inside a string in R, you must use two backslashes inside the string. So the string
\\.gets transformed to the regex\..The regex function can be used to explicitly turn a string into a regex, although R applies this function automatically in most places it is needed.
Searching a string for a pattern contained in a regex takes time linear in the length of the string.
Another way to present a pattern is to use a glob. The glob2rx function turns a pattern given by a glob into a pattern given by a regex.
13.1 Searching for special characters
The . is a wildcard character that matches any symbol in a string.
## [,1]
## [1,] "AA"
## [2,] "AB"
## [3,] "AC"
## [4,] NA
Of course, this raises the question, how would you search for an actual period, . inside a string? How to match the string ST. LOUIS exactly?
13.2 Using the backslash
In order to match ., or ^, or $, or any of the other special characters in regex expressions, it is necessary to use escape characters.
Escape characters are symbols that match different symbols by adding extra indicator characters. Usually they are of the form \<character>, where \ is the backslash.
For instance, inside of a regex, \. matches against an actual period ".", rather than acting like a wildcard.
13.3 The difference between a string and a regex
Okay, so if the regex ST\. LOUIS were to be created, then it would match the string `“ST. LOUIS”. But how exactly is a regex created? Here it is important to talk a bit about how regex are constructed in R (and many other languages.)
There is an explicit function regex that converts a string to a regex.
## [1] "test"
## attr(,"options")
## attr(,"options")$case_insensitive
## [1] FALSE
##
## attr(,"options")$comments
## [1] FALSE
##
## attr(,"options")$dotall
## [1] FALSE
##
## attr(,"options")$multiline
## [1] FALSE
##
## attr(,"class")
## [1] "stringr_regex" "stringr_pattern" "character"
Note that the final output gives the class as regex, pattern, and character. So this is definitely not a simple string!
Here is the thing, in order to keep things simple, R will normally apply the regex function automatically when it is needed. Consider the following.
## [,1]
## [1,] "ing"
The function str_match needs a pattern, a regex as an argument. But it was given a string. So what R does is automatically apply the regex function to the string in order to get a pattern.
## [,1]
## [1,] "ing"
Exactly the same!
13.4 Creating a backslash inside of strings
Why is this important if R does it automatically? Well, it helps explain what comes next!
Suppose the following command is typed into R.
## Error: '\.' is an unrecognized escape in character string (<input>:1:4)
This gives an error. R tells the user that \. is an unrecognized escape character. You cannot put this escape character into a string! So how can the user get it into a regex?
What is needed is a way to just pass a backslash as a string. The answer is to use the escape character for backslash, which is \\!
## [1] "a\\.b"
Now the string holds the escape character for backslash. It will be interpreted correctly when the regex function is applied.
## [,1]
## [1,] "a.b"
## [2,] NA
13.5 Escape characters in regex
The characters ()[]^$*+?\ all have special meanings within a regex. So they all have escape characters, \(, \), \[, \], \^, \$, \*, \+, \?, and \\.
And remember, when writing these within strings, every backslash \ in the regex is replaced by two backslashes "\\" in the string. So \$ in the regex becomes "\\$" in the string.
This leads to an interesting conclusion: to get the regex pattern to match a backslash, the regex is \\, making the string "\\\\". Wow!
## [1] "\\" "\\." "abc"
## [,1]
## [1,] "\\"
## [2,] "\\"
## [3,] NA
13.6 Back references
When the wildcard character . is used, it matches any symbol. But which symbol exactly did it match? In standard regex, this information is gone for good once the wildcard is used.
The idea of a back reference (also sometimes written without the space as backreference) allows the pattern to be used again.
For instance, suppose a string has the form of a first string, then a space, then a second string.
In two of these strings, the first string is exactly the same as the second string. Now suppose the goal is to check if the first string is the same as the second string. This could be accomplished using our existing tools as follows. First consider how to pull out the first string. In the following regex, the first ^ is an anchor to match the pattern to the beginning of the string, while the ^ inside the brackets is a negation to accept any number of symbols that are not the space character.
## [,1]
## [1,] "abc"
## [2,] "ab"
## [3,] "ca"
Similarly, the $ can be used to anchor the pattern to the end of the strings.
## [,1]
## [1,] "abc"
## [2,] "ab"
## [3,] "ad"
Finally, simple comparison can check if the two strings are the same.
## [,1]
## [1,] TRUE
## [2,] TRUE
## [3,] FALSE
It becomes a little trickier if the goal is to return the value of the strings if they are the same, and NA otherwise.
So our code is becoming complicated and (more) difficult to understand. Is there a way to do better? The answer is to use a back reference.
Suppose in a pattern, a subpattern using wildcards inside parentheses is assigned a number that can be used later to recall the characters in the searched string that matched the wildcards. This is called a backreference.
To illustrate, consider putting the search before the space in parentheses.
## [,1] [,2]
## [1,] "abc" "abc"
## [2,] "ab" "ab"
## [3,] "ca" "ca"
Because the ([^ ]+) is now inside parentheses, it gets assigned a number. Since it was the first set of parentheses used, the number is 1. At this point, using \1 gives the same set of characters that actually match the wildcard. So for string "abc abc", \1 is equal to "abc". For string "ab ab", \1 is equal to "ab". This can then be used to search in the rest of the string.
## [,1] [,2]
## [1,] "abc abc" "abc"
## [2,] "ab ab" "ab"
## [3,] NA NA
In words, the regex ^([^ ]+) \1 means that starting at the beginning of the search string, look for characters until a space is found. Record the characters that were searched until the space is found. Then look for a space. Then search to see if the exact same characters were found in the string again.
Note that the characters in the backreference were returned in the second column. If that string was the goal, then the final code would look as follows.
## [1] "abc" "ab" NA
13.7 The speed of regex
At this point it might be good to question, why are regex so widely used? After all, they go against the basic principles of clear, readable code. Instead, a regex encodes a program in a way that is impenetrable to someone who has not studied their special characters.
There are two answers to this. First, the notation for regex was created at a time when computers were much slower than today. The compactness of the notation was a feature, as input was introduced either by direct wiring, or a bit later through paper card input, or magnetic tape input. In all cases, keeping program sizes as small as possible was actually a feature, not a problem.
The second reason (and why regex are still used today) is that they are fast. Not just fast for the problem, but as fast as is possible to search a string. Computer scientists say that a regex solves the pattern matching problem in linear time. The time needed to see if a pattern written as a regex falls in a string is almost the number of characters in the string. Since every character in a string must be read in order to do the search, this is the fastest possible.
To see why, consider a search for the regex pattern a.b in the string "safcadbr". The computer starts with the first character in the regex, a, and compares it to the first character in the string "s". They are different. Therefore, the computer starts from the beginning of the regex, and moves to the next character of the string, making that the current string character.
Next the computer looks at the first character in the regex, a, and compares it to the current character of the string "a". They are the same! Hence the computer moves to the next character in the regex, and next character in the string.
That means the computer will compare the character in the regex . to the character in the string "f". They match (because the regex character . matches everything). Hence the computer moves to the next character of the string, and the next character of the regex.
Oh no, the next character of the regex is b, but the next character of the string is "c". They do not match. So the computer moves back to the beginning character of the regex, and next character of the string.
The whole search of the string by the regex is contained in the following table.
| Position regex | Position string | Char regex | Char String | Result |
|---|---|---|---|---|
| 1 | 1 | a | s | No match |
| 1 | 2 | a | a | Match |
| 2 | 3 | . | f | Match |
| 3 | 4 | b | c | No match |
| 1 | 5 | a | a | Match |
| 2 | 6 | . | d | Match |
| 3 | 7 | b | b | Match/End search successfully |
Note that as the pattern was searched for the string, no matter what happened, at the next step the next character in the string was examined. There is no going back, only moving forward. And that is why using a regex to search for a pattern in a string is so fast, and why they are still used today in data science.
13.8 Alternations
The vertical bar | is the alternation symbol in regular expressions, and acts like a logical OR. For instance, cat|dog detects either the string "cat" or the string "dog".
## [1] TRUE TRUE FALSE
13.9 Regular expressions and separate
For the separate command the separation parameter can be a regular expression. Consider the following data.
## # A tibble: 2 × 1
## name
## <chr>
## 1 Trial - by : fire
## 2 A - dozen : eggs
To break apart these entries into three words, the regular expression "( - )|( : )" could be used. The first parenthetical expression captures the first hyphen surrounded by spaces used to separate the first and second columns. The second parenthetical expression captures the colon surrounded by spaces that separates the second and third columns. The result is as follows.
tibble(
name = c("Trial - by : fire", "A - dozen : eggs")
) |>
separate(name, into = c("word1", "word2", "word3"), sep = "( - )|( : )")## # A tibble: 2 × 3
## word1 word2 word3
## <chr> <chr> <chr>
## 1 Trial by fire
## 2 A dozen eggs
13.9.1 Example: tidying the U.S. Senate List
The U.S. Senate website keeps a list of current senators. The data set senate116.txt (retrieved from https://www.senate.gov/general/contact_information/senators_cfm.cfm on 2019-03-05) contains the name, party, state, class, address, phone, and contact URL for all 100 members of the 116th Congress, but it is not tidy data. It can be, though!
First read in the data.
## Rows: 400 Columns: 1
## ── Column specification ───────────────────────────────────
## Delimiter: "\001"
## chr (1): X1
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A quick look at the data reveals that the column is a pattern of four repeating lines. In order, these contain what will be referred to as "stuff", "address", "phone", "contact". To break these into separate columns, mutate can be used to add a column with the proper names, then pivot_wider can move the rows into the proper position. There are 100 senators, so this pattern repeats itself 100 times. The rep command can be used to do this. It is also useful to have an ID number for the senators, which will just be added (like the data is presented) in alphabetical order.
senate_raw_data |>
mutate(type = rep(c("stuff", "address", "phone", "contact"), 100)) |>
mutate(id = floor((0:399) / 4) + 1, .before = 1)## # A tibble: 400 × 3
## id X1 type
## <dbl> <chr> <chr>
## 1 1 Alexander, Lamar - (R - TN) Class II stuff
## 2 1 455 Dirksen Senate Office Building Washington … addr…
## 3 1 (202) 224-4944 phone
## 4 1 Contact: www.alexander.senate.gov/public/index… cont…
## 5 2 Baldwin, Tammy - (D - WI) Class I stuff
## 6 2 709 Hart Senate Office Building Washington DC … addr…
## 7 2 (202) 224-5653 phone
## 8 2 Contact: www.baldwin.senate.gov/feedback cont…
## 9 3 Barrasso, John - (R - WY) Class I stuff
## 10 3 307 Dirksen Senate Office Building Washington … addr…
## # ℹ 390 more rows
Now use pivot_wider to put the rows into place.
senate_raw_data |>
mutate(type = rep(c("stuff", "address", "phone", "contact"), 100)) |>
mutate(id = floor((0:399) / 4) + 1, .before = 1) |>
pivot_wider(names_from = type, values_from = X1) -> senate_nearly_tidy_data
senate_nearly_tidy_data## # A tibble: 100 × 5
## id stuff address phone contact
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 Alexander, Lamar - (R - TN) Cl… 455 Di… (202… Contac…
## 2 2 Baldwin, Tammy - (D - WI) Clas… 709 Ha… (202… Contac…
## 3 3 Barrasso, John - (R - WY) Clas… 307 Di… (202… Contac…
## 4 4 Bennet, Michael F. - (D - CO) … 261 Ru… (202… Contac…
## 5 5 Blackburn, Marsha - (R - TN) C… 357 Di… (202… Contac…
## 6 6 Blumenthal, Richard - (D - CT)… 706 Ha… (202… Contac…
## 7 7 Blunt, Roy - (R - MO) Class III 260 Ru… (202… Contac…
## 8 8 Booker, Cory A. - (D - NJ) Cla… 717 Ha… (202… Contac…
## 9 9 Boozman, John - (R - AR) Class… 141 Ha… (202… Contac…
## 10 10 Braun, Mike - (R - IN) Class I B85 Ru… (202… Contac…
## # ℹ 90 more rows
Now it’s your turn! Finish tidying up this data with separate and unite: extract from the stuff variable the name, party, state, and class of the observation, and eliminate the Contact: before the web address in the contact variable. For instance, the first observation should be
## # A tibble: 1 × 7
## name party state class address phone contact
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Alexander, Lamar R TN II 455 Dirkse… (202… www.al…
A string like "Alexander, Lamar - (R - TN) Class II" can be broken apart with the following separator:
tibble(
name = "Alexander, Lamar - (R - TN) Class II"
) |>
separate(name,
into = c("name", "party", "state", "class", "address"),
sep = "( - \\()|( - )|(\\) Class )")## Warning: Expected 5 pieces. Missing pieces filled with `NA` in 1
## rows [1].
## # A tibble: 1 × 5
## name party state class address
## <chr> <chr> <chr> <chr> <chr>
## 1 Alexander, Lamar R TN II <NA>
13.10 Pattern matching with globs
One common application of pattern matching is to sort through names of files. In R, the dir function can be used to get a list of the file names in a particular directory. Suppose that this command was run and the following list of files was returned.
Then to sort out the files ending in .txt, the following regex does the trick:
## [,1]
## [1,] "weather1.txt"
## [2,] "weather23.txt"
## [3,] NA
## [4,] NA
Here the .txt part is called the file extension and it is common to want to search for files with a particular file extension. For that reason, another form of pattern matching is used in many operating systems called globbing.
A glob is a pattern matching scheme that is often used with file directories. It is notable for its two wildcard characters. First, * matches any continuous sequence of characters. Second, ? matches any single character.
In the search above, the regex .+\.tex is equivalent to *.txt in glob form. If you wish to use globs directly, the glob2rx can be used to convert a glob expression to a regex.
## [,1]
## [1,] "weather1.txt"
## [2,] "weather23.txt"
## [3,] NA
## [4,] NA
The ? symbol matches any character. So ??? in a glob will match any three letter file extension.
## [,1]
## [1,] "weather1.txt"
## [2,] "weather23.txt"
## [3,] "weather23.csv"
## [4,] NA
Similarly, .???? at the end will match any four letter file extension.
## [,1]
## [1,] NA
## [2,] NA
## [3,] NA
## [4,] "weather47.html"
Questions
What string answer01 matches the pattern (Why?)(Because.) somewhere in a string? So regex(answer01) should match with "(How?)(Why?)(Because.)" but not "(How?)(Why.)(Because?)"
Suppose that I have a regular expression \(\[[a-z]*\]\) that looks for any number of lowercase letters surround by brackets and then surrounded by parenthesis. Assign to answer02 the string where regex(answer02) creates this regular expression.
Write a function answer03 that takes as input a vector of strings, and returns true for each string that contains two digits followed by the same two digits. So "a57572b" should match, but "57725" should not.
Create a function
answer04athat takes as input a vector of strings, and checks if each string is both nonempty and only contains digits. So"23443"and"2"should match, but""and"2452abc"should not. For strings in the vector that match, it should return TRUE, and strings that don’t it should return FALSE.Create a function
answer04bthat matches a left bracket, followed by any number (including zero) of digits, followed by a right bracket. So"[23443]"and"[]"should match, but"(23443)"and"[23gdg]"should not. Have your function extract the first pattern that matches if such exists in the string.
Using str_replace_all write a function answer05 that takes a string s as input and replaces every instance of digits surrounded by parenthesis with the same digits surrounded by the vertical bar |. So "(345)(874)[273]" gets turned into "|345||874|[273]".