I need your help!

If you find any typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

Opening an issue or submitting a pull request on GitHub

Hypothesis Adding an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

14 Strings

14.1 Introduction

14.2 String basics

Exercise 14.2.1

In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

The function paste() separates strings by spaces by default, while paste0() does not separate strings with spaces by default.

Since str_c() does not separate strings with spaces by default it is closer in behavior to paste0().

However, str_c() and the paste function handle NA differently. The function str_c() propagates NA, if any argument is a missing value, it returns a missing value. This is in line with how the numeric R functions, e.g. sum(), mean(), handle missing values. However, the paste functions, convert NA to the string "NA" and then treat it as any other character vector.

Exercise 14.2.2

In your own words, describe the difference between the sep and collapse arguments to str_c().

The sep argument is the string inserted between arguments to str_c(), while collapse is the string used to separate any elements of the character vector into a character vector of length one.

Exercise 14.2.3

Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

The following function extracts the middle character. If the string has an even number of characters the choice is arbitrary. We choose to select \(\lceil n / 2 \rceil\), because that case works even if the string is only of length one. A more general method would allow the user to select either the floor or ceiling for the middle character of an even string.

Exercise 14.2.4

What does str_wrap() do? When might you want to use it?

The function str_wrap() wraps text so that it fits within a certain width. This is useful for wrapping long strings of text to be typeset.

Exercise 14.2.5

What does str_trim() do? What’s the opposite of str_trim()?

Exercise 14.2.6

Write a function that turns (e.g.) a vector c("a", "b", "c") into the string "a, b, and c". Think carefully about what it should do if given a vector of length 0, 1, or 2.

14.3 Matching patterns with regular expressions

14.3.1 Basic matches

Exercise 14.3.1.1

Explain why each of these strings don’t match a \: "\", "\\", "\\\".

  • "\": This will escape the next character in the R string.
  • "\\": This will resolve to \ in the regular expression, which will escape the next character in the regular expression.
  • "\\\": The first two backslashes will resolve to a literal backslash in the regular expression, the third will escape the next character. So in the regular expression, this will escape some escaped character.

Exercise 14.3.1.2

How would you match the sequence "'\ ?

Exercise 14.3.1.3

What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

It will match any patterns that are a dot followed by any character, repeated three times.

14.3.2 Anchors

Exercise 14.3.2.1

How would you match the literal string "$^$"?

To check that the pattern works, I’ll include both the string "$^$", and an example where that pattern occurs in the middle of the string which should not be matched.

Exercise 14.3.2.2

Given the corpus of common words in stringr::words, create regular expressions that find all words that:

  1. Start with “y”.
  2. End with “x”
  3. Are exactly three letters long. (Don’t cheat by using str_length()!)
  4. Have seven letters or more.

Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.

The answer to each part follows.

  1. The words that start with “y” are:

  2. End with “x”

  3. Are exactly three letters long are

  4. The words that have seven letters or more:

    Since the pattern ....... is not anchored with either . or $ this will match any word with at last seven letters. The pattern, ^.......$, matches words with exactly seven characters.

14.3.3 Character classes and alternatives

Exercise 14.3.3.1

Create regular expressions to find all words that:

  1. Start with a vowel.
  2. That only contain consonants. (Hint: thinking about matching “not”-vowels.)
  3. End with ed, but not with eed.
  4. End with ing or ise.

The answer to each part follows.

  1. Words starting with vowels

    str_subset(stringr::words, "^[aeiou]")
    #>   [1] "a"           "able"        "about"       "absolute"    "accept"     
    #>   [6] "account"     "achieve"     "across"      "act"         "active"     
    #>  [11] "actual"      "add"         "address"     "admit"       "advertise"  
    #>  [16] "affect"      "afford"      "after"       "afternoon"   "again"      
    #>  [21] "against"     "age"         "agent"       "ago"         "agree"      
    #>  [26] "air"         "all"         "allow"       "almost"      "along"      
    #>  [31] "already"     "alright"     "also"        "although"    "always"     
    #>  [36] "america"     "amount"      "and"         "another"     "answer"     
    #>  [41] "any"         "apart"       "apparent"    "appear"      "apply"      
    #>  [46] "appoint"     "approach"    "appropriate" "area"        "argue"      
    #>  [51] "arm"         "around"      "arrange"     "art"         "as"         
    #>  [56] "ask"         "associate"   "assume"      "at"          "attend"     
    #>  [61] "authority"   "available"   "aware"       "away"        "awful"      
    #>  [66] "each"        "early"       "east"        "easy"        "eat"        
    #>  [71] "economy"     "educate"     "effect"      "egg"         "eight"      
    #>  [76] "either"      "elect"       "electric"    "eleven"      "else"       
    #>  [81] "employ"      "encourage"   "end"         "engine"      "english"    
    #>  [86] "enjoy"       "enough"      "enter"       "environment" "equal"      
    #>  [91] "especial"    "europe"      "even"        "evening"     "ever"       
    #>  [96] "every"       "evidence"    "exact"       "example"     "except"     
    #> [101] "excuse"      "exercise"    "exist"       "expect"      "expense"    
    #> [106] "experience"  "explain"     "express"     "extra"       "eye"        
    #> [111] "idea"        "identify"    "if"          "imagine"     "important"  
    #> [116] "improve"     "in"          "include"     "income"      "increase"   
    #> [121] "indeed"      "individual"  "industry"    "inform"      "inside"     
    #> [126] "instead"     "insure"      "interest"    "into"        "introduce"  
    #> [131] "invest"      "involve"     "issue"       "it"          "item"       
    #> [136] "obvious"     "occasion"    "odd"         "of"          "off"        
    #> [141] "offer"       "office"      "often"       "okay"        "old"        
    #> [146] "on"          "once"        "one"         "only"        "open"       
    #> [151] "operate"     "opportunity" "oppose"      "or"          "order"      
    #> [156] "organize"    "original"    "other"       "otherwise"   "ought"      
    #> [161] "out"         "over"        "own"         "under"       "understand" 
    #> [166] "union"       "unit"        "unite"       "university"  "unless"     
    #> [171] "until"       "up"          "upon"        "use"         "usual"
  2. Words that contain only consonants: Use the negate argument of str_subset.

    Alternatively, using str_view() the consonant-only words are:

  3. Words that end with “-ed” but not ending in “-eed”.

    The pattern above will not match the word "ed". If we wanted to include that, we could include it as a special case.

  4. Words ending in ing or ise:

Exercise 14.3.3.2

Empirically verify the rule “i” before e except after “c”.

Exercise 14.3.3.3

Is “q” always followed by a “u”?

In the stringr::words dataset, yes.

In the English language— no. However, the examples are few, and mostly loanwords, such as “burqa” and “cinq”. Also, “qwerty”. That I had to add all of those examples to the list of words that spellchecking should ignore is indicative of their rarity.

Exercise 14.3.3.4

Write a regular expression that matches a word if it’s probably written in British English, not American English.

In the general case, this is hard, and could require a dictionary. But, there are a few heuristics to consider that would account for some common cases: British English tends to use the following:

  • “ou” instead of “o”
  • use of “ae” and “oe” instead of “a” and “o”
  • ends in ise instead of ize
  • ends in yse

The regex ou|ise$|ae|oe|yse$ would match these.

There are other spelling differences between American and British English but they are not patterns amenable to regular expressions. It would require a dictionary with differences in spellings for different words.

Exercise 14.3.3.5

Create a regular expression that will match telephone numbers as commonly written in your country.

<div class="alert alert-primary hints-alert> This answer can be improved and expanded.

The answer to this will vary by country.

For the United States, phone numbers have a format like 123-456-7890 or (123)456-7890). These regular expressions will parse the first form

The regular expressions will parse the second form:

This regular expression can be simplified with the {m,n} regular expression modifier introduced in the next section,

Note that this pattern doesn’t account for phone numbers that are invalid due to an invalid area code. Nor does this pattern account for special numbers like 911. It also doesn’t parse a leading country code or an extensions. See the Wikipedia page for the North American Numbering Plan for more information on the complexities of US phone numbers, and this Stack Overflow question for a discussion of using a regex for phone number validation. The R package dialr implements robust phone number parsing. Generally, for patterns like phone numbers or URLs it is better to use a dedicated package. It is easy to match the pattern for the most common cases and useful for learning regular expressions, but in real applications there often edge cases that are handled by dedicated packages.

14.3.4 Repetition

Exercise 14.3.4.1

Describe the equivalents of ?, +, * in {m,n} form.

Pattern {m,n} Meaning
? {0,1} Match at most 1
+ {1,} Match 1 or more
* {0,} Match 0 or more

For example, let’s repeat the examples in the chapter, replacing ? with {0,1}, + with {1,}, and * with {*,}.

The chapter does not contain an example of *. This pattern looks for a “C” optionally followed by any number of “L” or “X” characters.

Exercise 14.3.4.2

Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

  1. ^.*$
  2. "\\{.+\\}"
  3. \d{4}-\d{2}-\d{2}
  4. "\\\\{4}"

The answer to each part follows.

  1. ^.*$ will match any string. For example: ^.*$: c("dog", "$1.23", "lorem ipsum").

  2. "\\{.+\\}" will match any string with curly braces surrounding at least one character. For example: "\\{.+\\}": c("{a}", "{abc}").

  3. \d{4}-\d{2}-\d{2} will match four digits followed by a hyphen, followed by two digits followed by a hyphen, followed by another two digits. This is a regular expression that can match dates formatted like “YYYY-MM-DD” (“%Y-%m-%d”). For example: \d{4}-\d{2}-\d{2}: 2018-01-11

  4. "\\\\{4}" is \\{4}, which will match four backslashes. For example: "\\\\{4}": "\\\\\\\\".

Exercise 14.3.4.3

Create regular expressions to find all words that:

  1. Start with three consonants.
  2. Have three or more vowels in a row.
  3. Have two or more vowel-consonant pairs in a row.

The answer to each part follows.

  1. This regex finds all words starting with three consonants.

  2. This regex finds three or more vowels in a row:

  3. This regex finds two or more vowel-consonant pairs in a row.

Exercise 14.3.4.4

Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/

Exercise left to reader. That site validates its solutions, so they aren’t repeated here.

14.3.5 Grouping and backreferences

Exercise 14.3.5.1

Describe, in words, what these expressions will match:

  1. (.)\1\1 :
  2. "(.)(.)\\2\\1":
  3. (..)\1:
  4. "(.).\\1.\\1":
  5. "(.)(.)(.).*\\3\\2\\1"

The answer to each part follows.

  1. (.)\1\1: The same character appearing three times in a row. E.g. "aaa"
  2. "(.)(.)\\2\\1": A pair of characters followed by the same pair of characters in reversed order. E.g. "abba".
  3. (..)\1: Any two characters repeated. E.g. "a1a1".
  4. "(.).\\1.\\1": A character followed by any character, the original character, any other character, the original character again. E.g. "abaca", "b8b.b".
  5. "(.)(.)(.).*\\3\\2\\1" Three characters followed by zero or more characters of any kind followed by the same three characters but in reverse order. E.g. "abcsgasgddsadgsdgcba" or "abccba" or "abc1cba".

Exercise 14.3.5.2

Construct regular expressions to match words that:

  1. Start and end with the same character.
  2. Contain a repeated pair of letters (e.g. church'' containsch’’ repeated twice.)
  3. Contain one letter repeated in at least three places (e.g. eleven'' contains threee’’s.)

The answer to each part follows.

  1. This regular expression matches words that start and end with the same character.

  2. This regular expression will match any pair of repeated letters, where letters is defined to be the ASCII letters A-Z. First, check that it works with the example in the problem.

    Now, find all matching words in words.

    The \\1 pattern is called a backreference. It matches whatever the first group matched. This allows the pattern to match a repeating pair of letters without having to specify exactly what pair letters is being repeated.

    Note that these patterns are case sensitive. Use the case insensitive flag if you want to check for repeated pairs of letters with different capitalization.

  3. This regex matches words that contain one letter repeated in at least three places. First, check that it works with th example given in the question.

    Now, retrieve the matching words in words.

14.4 Tools

14.4.1 Detect matches

Exercise 14.4.1.1

For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

  1. Find all words that start or end with x.
  2. Find all words that start with a vowel and end with a consonant.
  3. Are there any words that contain at least one of each different vowel?

The answer to each part follows.

  1. Words that start or end with x?

  2. Words starting with vowel and ending with consonant.

  3. There is not a simple regular expression to match words that that contain at least one of each vowel. The regular expression would need to consider all possible orders in which the vowels could occur.

    To check that this pattern works, test it on a pattern that should match

    Using multiple str_detect() calls, one pattern for each vowel, produces a much simpler and readable answer.

    There appear to be none.

Exercise 14.4.1.2

What word has the higher number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)

14.4.2 Extract matches

Exercise 14.4.2.1

In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a color. Modify the regex to fix the problem.

This was the original color match pattern:

It matches “flickered” because it matches “red”. The problem is that the previous pattern will match any word with the name of a color inside it. We want to only match colors in which the entire word is the name of the color. We can do this by adding a \b (to indicate a word boundary) before and after the pattern:

Exercise 14.4.2.2

From the Harvard sentences data, extract:

  1. The first word from each sentence.
  2. All words ending in ing.
  3. All plurals.

The answer to each part follows.

  1. Finding the first word in each sentence requires defining what a pattern constitutes a word. For the purposes of this question, I’ll consider a word any contiguous set of letters. Since str_extract() will extract the first match, if it is provided a regular expression for words, it will return the first word.

    However, the third sentence begins with “It’s”. To catch this, I’ll change the regular expression to require the string to begin with a letter, but allow for a subsequent apostrophe.

  2. This pattern finds all words ending in ing.

  3. Finding all plurals cannot be correctly accomplished with regular expressions alone. Finding plural words would at least require morphological information about words in the language. See WordNet for a resource that would do that. However, identifying words that end in an “s” and with more than three characters, in order to remove “as”, “is”, “gas”, etc., is a reasonable heuristic.

14.4.4 Replacing matches

Exercise 14.4.4.1

Replace all forward slashes in a string with backslashes.

14.4.5 Splitting

Exercise 14.4.5.1

Split up a string like "apples, pears, and bananas" into individual components.

Exercise 14.4.5.2

Why is it better to split up by boundary("word") than " "?

Splitting by boundary("word") is a more sophisticated method to split a string into words. It recognizes non-space punctuation that splits words, and also removes punctuation while retaining internal non-letter characters that are parts of the word, e.g., “can’t” See the ICU website for a description of the set of rules that are used to determine word boundaries.

Consider this sentence from the official Unicode Report on word boundaries,

Splitting the string on spaces considers will group the punctuation with the words,

However, splitting the string using boundary("word") correctly removes punctuation, while not separating “32.2” and “can’t”,

Exercise 14.4.5.3

What does splitting with an empty string ("") do? Experiment, and then read the documentation.

It splits the string into individual characters.

14.4.6 Find matches

No exercises

14.5 Other types of pattern

Exercise 14.5.1

How would you find all strings containing \ with regex() vs. with fixed()?

Exercise 14.5.2

What are the five most common words in sentences?

Using str_extract_all() with the argument boundary("word") will extract all words. The rest of the code uses dplyr functions to count words and find the most common words.

14.6 Other uses of regular expressions

No exercises

14.7 stringi

Exercise 14.7.1

Find the stringi functions that:

  1. Count the number of words.
  2. Find duplicated strings.
  3. Generate random text.

The answer to each part follows.

  1. To count the number of words use stringi::stri_count_words(). This code counts the words in the first five sentences of sentences.

  2. The stringi::stri_duplicated() function finds duplicate strings.

  3. The stringi package contains several functions beginning with stri_rand_* that generate random text. The function stringi::stri_rand_strings() generates random strings. The following code generates four random strings each of length five.

    The function stringi::stri_rand_shuffle() randomly shuffles the characters in the text.

    The function stringi::stri_rand_lipsum() generates lorem ipsum text. Lorem ipsum text is nonsense text often used as placeholder text in publishing. The following code generates one paragraph of placeholder text.

Exercise 14.7.2

How do you control the language that stri_sort() uses for sorting?

You can set a locale to use when sorting with either stri_sort(..., opts_collator=stri_opts_collator(locale = ...)) or stri_sort(..., locale = ...). In this example from the stri_sort() documentation, the sorted order of the character vector depends on the locale.

The output of stri_opts_collator() can also be used for the locale argument of str_sort.

The stri_opts_collator() provides finer grained control over how strings are sorted. In addition to setting the locale, it has options to customize how cases, unicode, accents, and numeric values are handled when comparing strings.