Regular expressions

Regular expressions are a method for describing patterns in text strings. They can be useful for searching and replacing text and provide efficient ways to fix up messy data.

Lets start with an example. You read in some data that uses dollar signs and comma separators e.g. $2,345,435.67. The function read.csv will interpret this as character data and it is not straightforward to coerce to a numerical value because as.numeric will return NA.

x = "$2,345,435.67"
as.numeric(x) ## [1] NA

Of course it would be possible to manually edit the values to remove the unwanted $ signs and commas but this is seldom practical. Regular expressions provide a way of automating this editing process. The logic is to search for “$” and “,” and replace them with “” i.e nothing. That gets rid of the parts that are not needed, leaving the numbers.

The easiest way to use regular expressions is with functions from the stringr package which provides a consistent interface for text processing. For this example, check out the help pages for str_replace and str_replace_all.

library(stringr)
?str_replace
?str_replace_all

The syntax is st_replace_all(string, pattern, replacement). In this case, the replacement is the empty string “” and the pattern must specify “$” and “,”. Placing the search pattern inside square brackets,”[$,]” instructs R to search for one or both characters.

library(stringr)

str_replace_all("$2,345,435.67", "[$,]","")
## [1] "2345435.67"

Fixing column names

Another common issue is that read.csv will coerce column headings to names that are syntactically valid. The rules are explained in the help to the function make.names. All invalid characters are translated to “.”.

The example below shows that what happens if we read in data with headings “Temp. (C)” and “DO (% sat.)”.



WQ.data = read.csv(header = TRUE, sep = ",", text = "
Temp. (C), DO (% sat.)
10, 85
20, 87
")

names(WQ.data)

[1] "Temp...C."   "DO....sat.."

There are way too many dots. In this case, we use regular expressions to:

  • replace more than one dot with a single dot
  • remove trailing dots.

x = names(WQ.data)

str_replace_all(x, c("[.]+" = "\\.", # find one or more dots, replace with a dot.
                     "[.]$" = "") )  # replace a trailing dot with nothing

[1] "Temp.C" "DO.sat" 

Finding the trailing dot, requires the use of the meta-character “$”. In the previous example, the”$” was inside the square brackets so was treated literally. Outside square brackets it is an instruction to anchor the search pattern to the end of the target. The “+” meta-character means find one or more occurrences of the preceding pattern and the “.” meta-character is a wildcard that can be used to match any character except a newline. To treat a meta-character literally it must be preceded by two backslashes “\\”, or placed into square brackets. Meta-characters are summarized in Table 1.

Table 1: Meta-characters in regular expressions (adapted from Spector, 2008)

Meta-character Meaning
^ Anchors expression to beginning of target
$ Anchors expression to end of target
. Matches any single character except new line
\\ Treat the next meta-character literally
| Separates alternative patterns (“or”)
( ) Group patterns together
* Matches zero or more occurrences of preceding pattern
? Matches 0 or 1 occurrences of preceding pattern
+ Matches 1 or more occurrences of preceding pattern
{n} Matches exactly n occurrences of preceding pattern
{n,} Matches exactly n occurrences of preceding pattern
{n,m} Matches between n and m occurrences of preceding pattern
\\s space
\\S no space
\\d any digit
\\D not a digit
[ ] character class brackets

Characters inside square brackets are referred to as a character class and are usually treated literally unless the first character in the list is a caret “^”, which means match any character not included in the character class. To search for a caret, ensure it is anywhere in the character class other than first.

Table 2: Character classes in R

Character class Meaning
[0-9] or [:digits:] digits
[a-z] or [:lower:] lower case letters
[A-Z] or [:upper:] lpper case letters
[a-zA-Z] or [:alpha:] alphabetic characters
[a-zA-Z0-9] or [:alnum:] alphanumeric characters
[^a-zA-Z] or [^[:alnum:]] non-alphabetic characters
[:space:] whitespace characters
\t \n tab, newline,
[:punct:] punctuation characters

Some more examples


vec = c("< 2", "305 bar", "bar305", "other", "notIt 7", "12:23:34", "12:34") 

str_replace(vec, "\\s", "") # remove spaces

str_detect(vec, "<") # find strings that contain "<"

vec[str_detect(vec, "<")] # return strings in vec that contain "<"

str_subset(vec, "<") # return strings in vec that contain "<"

str_detect(vec, "[[:digit:]]") # strings that contain a digit

str_detect(vec, "[[:digit:]]{3}") # strings that contain 3 digits in a row

str_detect(vec, "^[[:digit:]]{3}") # strings that start with 3 digits

str_detect(vec, "[^(< 2)]") # strings that don't consist of  "< 2"

str_subset(vec, '^3') # strings that start with 3

str_subset(vec, '^.{3}$') # strings that are three characters long

str_subset(vec, '^.{8,}$') # strings that are 8 or more characters long

str_subset(vec, '^.{2,5}$') # strings that are between 2 and 5 characters long

str_subset(vec, '(?:^.{3}$|^.{5}$)' )# strings that are 2 or 5 characters long (?: ) specifies a group that isn't captured

str_replace("Remove bracket and its contents (123)", pattern = "\\s\\(.+\\)", "")
# [1] "Remove bracket and its contents"
# Removed the space before the bracket as well

# find any records that only consist of spaces
str_detect(' ', '^[ ]+$') # TRUE
# The pattern '^[ ]+$' will find a string that starts with a space
# ends with a space and contains one or more spaces

Tagging

Tagging allows the extraction and replacement of specific parts of a string. For example, consider the string “area = 54 km2 flow = 36 m3/s rainfall = 45 mm”. To extract just the numerical value of the flow i.e. 36 we need to use a regular expression that specifies the whole string and includes the part to be extracted in brackets. Then the bracketed sub-string is referred to by a number preceded by back slashes. See the example below. Note that “.*” means zero or more of any character and “[^ ]” is anything except a space”

vec = c("area = 54 km2 flow = 36 m3/s rainfall = 45 mm")

str_replace(vec, "^.*flow = ([^ ]+).*", "\\1")

[1] "36"

To extract two substrings, say the numeric value of flow and rainfall, the relevant parts of the string can be separately bracketed and tagged as follows.


str_replace(vec, "^.*flow = ([^ ]+).*rainfall = ([^ ]+).*", "\\1 \\2")

## [1] "36 45"

 #Alternatively we could use str_extract_all. str_extract_all(vec, "\\d")[[1]] # extract digits
#[1] "5" "4" "2" "3" "6" "3" "4" "5"

str_extract_all(vec, "\\d+")[[1]] # extract 1 or more digits
#[1] "54" "2" "36" "3" "45"

str_extract_all(vec, "\\b\\d+")[[1]] # extract digits after a word boundary
#[1] "54" "36" "45"

Reporting censored values consistently.

In water quality data, censored values (those above or below detection limits), are often reported using less than or greater than signs e.g. <0.001, >24,000.  Often spacing is inconsistent e.g. >24,000 or > 24,000 i.e there could be one or more spaces before or after the ‘>’ sign.

We can clean these up with regex.  Something like:

str_replace(x, '^\\s*[<>]\\s*(\\d+)', '<\\1')

Also check this Q&A.

Contaminated data

A very common problem is that data you are trying to read in for analysis contains unexpected characters.  For example, see Table 4.

Table 4: Initial loss values

Catchment Initial Loss (mm)
Rural 1 23
Rural 2 27
Urban 1 1.1
Urban 2 2.6 to 2.9

We were expecting initial loss to be a numeric but someone has input “2.6 to 2.9” as a value which R will interpret as a string.

One approach would be to adopt the average of 2.6 and 2.9 as the value to take forward for further analysis.  We use regex to extract the numerical values and then average them.

ConvertToNum = function(x){
x1 = str_extract_all(x, '[\\d.]+') # extract numbers

f = function(x){
mean(as.numeric(x))
}

sapply(x1, f)

}

# Example
x = c(23, 27, 1.1, "2.6 to 2.9")
ConvertToNum(x)
#[1] 23.00 27.00 1.10 2.75

Another approach, which is sometimes useful, is to use regex to build an R expression and then evaluate it.

x = "2.6 to 2.9"
expression.string <- str_replace(x, '(\\d+[.]\\d+)\\sto\\s(\\d+[.]\\d+)', '(\\1 + \\2)/2')

expression.string
#[1] "(2.6 + 2.9)/2"

eval(parse(text = expression.string))
#[1] 2.75

Further details and examples are here.

Lookaround

It is possible to use regular expressions to lookaround.  Consider the problem of extracting the month part of a date e.g. 14/1/2000 or 14/01/2000 i.e. 14 January 2000.

We are looking for 1 or 2 digits followed by a slash, then four digits.

library(stringr)

my.dates = c('14/1/2015')
str_extract(my.dates, '[0-9]{1,2}(?=/[0-9]{4})')

# "1"

my.dates = c('14/01/2015')
str_extract(my.dates, '[0-9]{1,2}(?=/[0-9]{4})')

# "01"

Thank you to Bryan Britten for the following: (?=/[0-9]{4}) is called a “negative look ahead”. The whole regular expression just says “I’m looking for 1-2 digits that are followed by a slash and then four digits”. str_extract will return only the 1-2 digits and not the pattern found in the negative look ahead.  For further details see http://www.regular-expressions.info/lookaround.html

Back referencing and capture groups

There is an example in the RStudio blog about the release of stringr 1.1.0 that stumped me for a while.

str_subset(fruit, "(..)\\1")
#> [1] "banana"      "coconut"     "cucumber"    "jujube"      "papaya"     
#> [6] "salal berry"

(..) is a capture group and 1 is a back reference.  This is finding any two letters that are repeated e.g. the ‘an’, ‘an’ in banana or the ‘cu’, ‘cu’ in cucumber.  Further details are here.

Further reading

A few suggestions

 

Leave a comment