R for Data Science: Exercise Solutions

Question 1

In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

Answer

The function paste() separates strings by spaces by default, while paste0() does not separate strings with spaces by default.

paste("foo", "bar")
#> [1] "foo bar"
paste0("foo", "bar")
#> [1] "foobar"

Since str_c() does not separate strings with spaces by default it is closer in behavior to paste0().

str_c("foo", "bar")
#> [1] "foobar"

However, str_c() and the paste function handle NA differently. The function str_c() propagates NA, if any argument is a missing value, it returns a missing value. This is in line with how the numeric R functions, e.g. sum(), mean(), handle missing values. However, the paste functions, convert NA to the string "NA" and then treat it as any other character vector.

str_c("foo", NA)
#> [1] NA
paste("foo", NA)
#> [1] "foo NA"
paste0("foo", NA)
#> [1] "fooNA"

Question 2

In your own words, describe the difference between the sep and collapse arguments to str_c().

Answer

The sep argument is the string inserted between arguments to str_c(), while collapse is the string used to separate any elements of the character vector into a character vector of length one.

Question 3

Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

Answer

The following function extracts the middle character. If the string has an even number of characters the choice is arbitrary. We choose to select $\lceil n / 2 \rceil$, because that case works even if the string is only of length one. A more general method would allow the user to select either the floor or ceiling for the middle character of an even string.

x <- c("a", "abc", "abcd", "abcde", "abcdef")
L <- str_length(x)
m <- ceiling(L / 2)
str_sub(x, m, m)
#> [1] "a" "b" "b" "c" "c"

Question 4

What does str_wrap() do? When might you want to use it?

Answer

The function str_wrap() wraps text so that it fits within a certain width. This is useful for wrapping long strings of text to be typeset.

Question 5

What does str_trim() do? What’s the opposite of str_trim()?

Answer

The function str_trim() trims the whitespace from a string.

str_trim(" abc ")
#> [1] "abc"
str_trim(" abc ", side = "left")
#> [1] "abc "
str_trim(" abc ", side = "right")
#> [1] " abc"

The opposite of str_trim() is str_pad() which adds characters to each side.

str_pad("abc", 5, side = "both")
#> [1] " abc "
str_pad("abc", 4, side = "right")
#> [1] "abc "
str_pad("abc", 4, side = "left")
#> [1] " abc"

Question 6

Write a function that turns (e.g.) a vector c("a", "b", "c") into the string "a, b, and c". Think carefully about what it should do if given a vector of length 0, 1, or 2.

Answer

See the Chapter [Functions] for more details on writing R functions.

This function needs to handle four cases.

n == 0: an empty string, e.g. "".
n == 1: the original vector, e.g. "a".
n == 2: return the two elements separated by “and”, e.g. "a and b".
n > 2: return the first n - 1 elements separated by commas, and the last element separated by a comma and “and”, e.g. "a, b, and c".

str_commasep <- function(x, delim = ",") {
  n <- length(x)
  if (n == 0) {
    ""
  } else if (n == 1) {
    x
  } else if (n == 2) {
    # no comma before and when n == 2
    str_c(x[[1]], "and", x[[2]], sep = " ")
  } else {
    # commas after all n - 1 elements
    not_last <- str_c(x[seq_len(n - 1)], delim)
    # prepend "and" to the last element
    last <- str_c("and", x[[n]], sep = " ")
    # combine parts with spaces
    str_c(c(not_last, last), collapse = " ")
  }
}
str_commasep("")
#> [1] ""
str_commasep("a")
#> [1] "a"
str_commasep(c("a", "b"))
#> [1] "a and b"
str_commasep(c("a", "b", "c"))
#> [1] "a, b, and c"
str_commasep(c("a", "b", "c", "d"))
#> [1] "a, b, c, and d"

Question 7

Explain why each of these strings don’t match a \: "\", "\\", "\\\".

Answer

"\": This will escape the next character in the R string.
"\\": This will resolve to \ in the regular expression, which will escape the next character in the regular expression.
"\\\": The first two backslashes will resolve to a literal backslash in the regular expression, the third will escape the next character. So in the regular expression, this will escape some escaped character.

Question 8

How would you match the sequence "'\ ?

Answer

str_view("\"'\\", "\"'\\\\", match = TRUE)

Question 9

What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

Answer

It will match any patterns that are a dot followed by any character, repeated three times.

str_view(c(".a.b.c", ".a.b", "....."), c("\\..\\..\\.."), match = TRUE)

Question 10

How would you match the literal string "$^$"?

Answer

To check that the pattern works, I’ll include both the string "$^$", and an example where that pattern occurs in the middle of the string which should not be matched.

str_view(c("$^$", "ab$^$sfas"), "^\\$\\^\\$$", match = TRUE)

Question 11

Given the corpus of common words in stringr::words, create regular expressions that find all words that:

Start with “y”.
End with “x”
Are exactly three letters long. (Don’t cheat by using str_length()!)
Have seven letters or more.

Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.

Answer

The answer to each part follows.

The words that start with “y” are:
```
str_view(stringr::words, "^y", match = TRUE)
```
End with “x”
```
str_view(stringr::words, "x$", match = TRUE)
```
Are exactly three letters long are
```
str_view(stringr::words, "^...$", match = TRUE)
```
The words that have seven letters or more:
```
str_view(stringr::words, ".......", match = TRUE)
```
Since the pattern ....... is not anchored with either . or $ this will match any word with at last seven letters. The pattern, ^.......$, matches words with exactly seven characters.

Question 12

Create regular expressions to find all words that:

Start with a vowel.
That only contain consonants. (Hint: thinking about matching “not”-vowels.)
End with ed, but not with eed.
End with ing or ise.

Question 13

Empirically verify the rule “i” before e except after “c”.

Answer

length(str_subset(stringr::words, "(cei|[^c]ie)"))
#> [1] 14

length(str_subset(stringr::words, "(cie|[^c]ei)"))
#> [1] 3

Question 14

Is “q” always followed by a “u”?

Answer

In the stringr::words dataset, yes.

str_view(stringr::words, "q[^u]", match = TRUE)

In the English language— no. However, the examples are few, and mostly loanwords, such as “burqa” and “cinq”. Also, “qwerty”. That I had to add all of those examples to the list of words that spellchecking should ignore is indicative of their rarity.

Question 15

Write a regular expression that matches a word if it’s probably written in British English, not American English.

Answer

In the general case, this is hard, and could require a dictionary. But, there are a few heuristics to consider that would account for some common cases: British English tends to use the following:

“ou” instead of “o”
use of “ae” and “oe” instead of “a” and “o”
ends in ise instead of ize
ends in yse

The regex ou|ise$|ae|oe|yse$ would match these.

There are other spelling differences between American and British English but they are not patterns amenable to regular expressions. It would require a dictionary with differences in spellings for different words.

Question 16

Create a regular expression that will match telephone numbers as commonly written in your country.

Answer

<div class="alert alert-primary hints-alert> This answer can be improved and expanded.

Question 17

Describe the equivalents of ?, +, * in {m,n} form.

Answer

Pattern	`{m,n}`	Meaning
`?`	`{0,1}`	Match at most 1
`+`	`{1,}`	Match 1 or more
`*`	`{0,}`	Match 0 or more

For example, let’s repeat the examples in the chapter, replacing ? with {0,1}, + with {1,}, and * with {*,}.

x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"

str_view(x, "CC?")

str_view(x, "CC{0,1}")

str_view(x, "CC+")

str_view(x, "CC{1,}")

str_view_all(x, "C[LX]+")

str_view_all(x, "C[LX]{1,}")

The chapter does not contain an example of *. This pattern looks for a “C” optionally followed by any number of “L” or “X” characters.

str_view_all(x, "C[LX]*")

str_view_all(x, "C[LX]{0,}")

Question 18

Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)

^.*$
"\\{.+\\}"
\d{4}-\d{2}-\d{2}
"\\\\{4}"

Answer

The answer to each part follows.

^.*$ will match any string. For example: ^.*$: c("dog", "$1.23", "lorem ipsum").
"\\{.+\\}" will match any string with curly braces surrounding at least one character. For example: "\\{.+\\}": c("{a}", "{abc}").
\d{4}-\d{2}-\d{2} will match four digits followed by a hyphen, followed by two digits followed by a hyphen, followed by another two digits. This is a regular expression that can match dates formatted like “YYYY-MM-DD” (“%Y-%m-%d”). For example: \d{4}-\d{2}-\d{2}: 2018-01-11
"\\\\{4}" is \\{4}, which will match four backslashes. For example: "\\\\{4}": "\\\\\\\\".

Question 19

Create regular expressions to find all words that:

Start with three consonants.
Have three or more vowels in a row.
Have two or more vowel-consonant pairs in a row.

Answer

The answer to each part follows.

This regex finds all words starting with three consonants.
```
str_view(words, "^[^aeiou]{3}", match = TRUE)
```
This regex finds three or more vowels in a row:
```
str_view(words, "[aeiou]{3,}", match = TRUE)
```
This regex finds two or more vowel-consonant pairs in a row.
```
str_view(words, "([aeiou][^aeiou]){2,}", match = TRUE)
```

Question 20

Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/

Answer

Exercise left to reader. That site validates its solutions, so they aren’t repeated here.

Question 21

Describe, in words, what these expressions will match:

(.)\1\1 :
"(.)(.)\\2\\1":
(..)\1:
"(.).\\1.\\1":
"(.)(.)(.).*\\3\\2\\1"

Answer

The answer to each part follows.

(.)\1\1: The same character appearing three times in a row. E.g. "aaa"
"(.)(.)\\2\\1": A pair of characters followed by the same pair of characters in reversed order. E.g. "abba".
(..)\1: Any two characters repeated. E.g. "a1a1".
"(.).\\1.\\1": A character followed by any character, the original character, any other character, the original character again. E.g. "abaca", "b8b.b".
"(.)(.)(.).*\\3\\2\\1" Three characters followed by zero or more characters of any kind followed by the same three characters but in reverse order. E.g. "abcsgasgddsadgsdgcba" or "abccba" or "abc1cba".

Question 22

Construct regular expressions to match words that:

Start and end with the same character.
Contain a repeated pair of letters (e.g. church'' containsch’’ repeated twice.)
Contain one letter repeated in at least three places (e.g. eleven'' contains threee’’s.)

Answer

The answer to each part follows.

This regular expression matches words that start and end with the same character.

str_subset(words, "^(.)((.*\\1$)|\\1?$)")
#>  [1] "a"          "america"    "area"       "dad"        "dead"      
#>  [6] "depend"     "educate"    "else"       "encourage"  "engine"    
#> [11] "europe"     "evidence"   "example"    "excuse"     "exercise"  
#> [16] "expense"    "experience" "eye"        "health"     "high"      
#> [21] "knock"      "level"      "local"      "nation"     "non"       
#> [26] "rather"     "refer"      "remember"   "serious"    "stairs"    
#> [31] "test"       "tonight"    "transport"  "treat"      "trust"     
#> [36] "window"     "yesterday"

This regular expression will match any pair of repeated letters, where letters is defined to be the ASCII letters A-Z. First, check that it works with the example in the problem.
```
str_subset("church", "([A-Za-z][A-Za-z]).*\\1")
#> [1] "church"
```
Now, find all matching words in words.
```
str_subset(words, "([A-Za-z][A-Za-z]).*\\1")
#>  [1] "appropriate" "church"      "condition"   "decide"      "environment"
#>  [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
#> [11] "pressure"    "remember"    "represent"   "require"     "sense"      
#> [16] "therefore"   "understand"  "whether"
```
The \\1 pattern is called a backreference. It matches whatever the first group matched. This allows the pattern to match a repeating pair of letters without having to specify exactly what pair letters is being repeated.

Note that these patterns are case sensitive. Use the case insensitive flag if you want to check for repeated pairs of letters with different capitalization.

This regex matches words that contain one letter repeated in at least three places. First, check that it works with th example given in the question.

str_subset("eleven", "([a-z]).*\\1.*\\1")
#> [1] "eleven"

Now, retrieve the matching words in words.

str_subset(words, "([a-z]).*\\1.*\\1")
#>  [1] "appropriate" "available"   "believe"     "between"     "business"   
#>  [6] "degree"      "difference"  "discuss"     "eleven"      "environment"
#> [11] "evidence"    "exercise"    "expense"     "experience"  "individual" 
#> [16] "paragraph"   "receive"     "remember"    "represent"   "telephone"  
#> [21] "therefore"   "tomorrow"

Question 23

For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.

Find all words that start or end with x.
Find all words that start with a vowel and end with a consonant.
Are there any words that contain at least one of each different vowel?

Answer

The answer to each part follows.

Words that start or end with x?

# one regex
words[str_detect(words, "^x|x$")]
#> [1] "box" "sex" "six" "tax"
# split regex into parts
start_with_x <- str_detect(words, "^x")
end_with_x <- str_detect(words, "x$")
words[start_with_x | end_with_x]
#> [1] "box" "sex" "six" "tax"

Words starting with vowel and ending with consonant.

str_subset(words, "^[aeiou].*[^aeiou]$") %>% head()
#> [1] "about"   "accept"  "account" "across"  "act"     "actual"
start_with_vowel <- str_detect(words, "^[aeiou]")
end_with_consonant <- str_detect(words, "[^aeiou]$")
words[start_with_vowel & end_with_consonant] %>% head()
#> [1] "about"   "accept"  "account" "across"  "act"     "actual"

There is not a simple regular expression to match words that that contain at least one of each vowel. The regular expression would need to consider all possible orders in which the vowels could occur.

pattern <-
  cross(rerun(5, c("a", "e", "i", "o", "u")),
    .filter = function(...) {
      x <- as.character(unlist(list(...)))
      length(x) != length(unique(x))
    }
  ) %>%
  map_chr(~str_c(unlist(.x), collapse = ".*")) %>%
  str_c(collapse = "|")

To check that this pattern works, test it on a pattern that should match

str_subset("aseiouds", pattern)
#> [1] "aseiouds"

Using multiple str_detect() calls, one pattern for each vowel, produces a much simpler and readable answer.

str_subset(words, pattern)
#> character(0)

words[str_detect(words, "a") &
  str_detect(words, "e") &
  str_detect(words, "i") &
  str_detect(words, "o") &
  str_detect(words, "u")]
#> character(0)

There appear to be none.

Question 24

What word has the higher number of vowels? What word has the highest proportion of vowels? (Hint: what is the denominator?)

Answer

The word with the highest number of vowels is

vowels <- str_count(words, "[aeiou]")
words[which(vowels == max(vowels))]
#> [1] "appropriate" "associate"   "available"   "colleague"   "encourage"  
#> [6] "experience"  "individual"  "television"

The word with the highest proportion of vowels is

prop_vowels <- str_count(words, "[aeiou]") / str_length(words)
words[which(prop_vowels == max(prop_vowels))]
#> [1] "a"

Question 25

In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a color. Modify the regex to fix the problem.

Answer

This was the original color match pattern:

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")

It matches “flickered” because it matches “red”. The problem is that the previous pattern will match any word with the name of a color inside it. We want to only match colors in which the entire word is the name of the color. We can do this by adding a \b (to indicate a word boundary) before and after the pattern:

colour_match2 <- str_c("\\b(", str_c(colours, collapse = "|"), ")\\b")
colour_match2
#> [1] "\\b(red|orange|yellow|green|blue|purple)\\b"

more2 <- sentences[str_count(sentences, colour_match) > 1]

str_view_all(more2, colour_match2, match = TRUE)

Question 26

From the Harvard sentences data, extract:

The first word from each sentence.
All words ending in ing.
All plurals.

Answer

The answer to each part follows.

Finding the first word in each sentence requires defining what a pattern constitutes a word. For the purposes of this question, I’ll consider a word any contiguous set of letters. Since str_extract() will extract the first match, if it is provided a regular expression for words, it will return the first word.
```
str_extract(sentences, "[A-ZAa-z]+") %>% head()
#> [1] "The"   "Glue"  "It"    "These" "Rice"  "The"
```
However, the third sentence begins with “It’s”. To catch this, I’ll change the regular expression to require the string to begin with a letter, but allow for a subsequent apostrophe.
```
str_extract(sentences, "[A-Za-z][A-Za-z']*") %>% head()
#> [1] "The"   "Glue"  "It's"  "These" "Rice"  "The"
```

This pattern finds all words ending in ing.

pattern <- "\\b[A-Za-z]+ing\\b"
sentences_with_ing <- str_detect(sentences, pattern)
unique(unlist(str_extract_all(sentences[sentences_with_ing], pattern))) %>%
  head()
#> [1] "spring"  "evening" "morning" "winding" "living"  "king"

Finding all plurals cannot be correctly accomplished with regular expressions alone. Finding plural words would at least require morphological information about words in the language. See WordNet for a resource that would do that. However, identifying words that end in an “s” and with more than three characters, in order to remove “as”, “is”, “gas”, etc., is a reasonable heuristic.
```
unique(unlist(str_extract_all(sentences, "\\b[A-Za-z]{3,}s\\b"))) %>%
  head()
#> [1] "planks" "days"   "bowls"  "lemons" "makes"  "hogs"
```

Question 27

Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.

Answer

numword <- "\\b(one|two|three|four|five|six|seven|eight|nine|ten) +(\\w+)"
sentences[str_detect(sentences, numword)] %>%
  str_extract(numword)
#>  [1] "seven books"   "two met"       "two factors"   "three lists"  
#>  [5] "seven is"      "two when"      "ten inches"    "one war"      
#>  [9] "one button"    "six minutes"   "ten years"     "two shares"   
#> [13] "two distinct"  "five cents"    "two pins"      "five robins"  
#> [17] "four kinds"    "three story"   "three inches"  "six comes"    
#> [21] "three batches" "two leaves"

Question 28

Find all contractions. Separate out the pieces before and after the apostrophe.

Answer

This is done in two steps. First, identify the contractions. Second, split the string on the contraction.

contraction <- "([A-Za-z]+)'([A-Za-z]+)"
sentences[str_detect(sentences, contraction)] %>%
  str_extract(contraction) %>%
  str_split("'")
#> [[1]]
#> [1] "It" "s" 
#> 
#> [[2]]
#> [1] "man" "s"  
#> 
#> [[3]]
#> [1] "don" "t"  
#> 
#> [[4]]
#> [1] "store" "s"    
#> 
#> [[5]]
#> [1] "workmen" "s"      
#> 
#> [[6]]
#> [1] "Let" "s"  
#> 
#> [[7]]
#> [1] "sun" "s"  
#> 
#> [[8]]
#> [1] "child" "s"    
#> 
#> [[9]]
#> [1] "king" "s"   
#> 
#> [[10]]
#> [1] "It" "s" 
#> 
#> [[11]]
#> [1] "don" "t"  
#> 
#> [[12]]
#> [1] "queen" "s"    
#> 
#> [[13]]
#> [1] "don" "t"  
#> 
#> [[14]]
#> [1] "pirate" "s"     
#> 
#> [[15]]
#> [1] "neighbor" "s"

Question 29

Replace all forward slashes in a string with backslashes.

Answer

str_replace_all("past/present/future", "/", "\\\\")
#> [1] "past\\present\\future"

Question 30

Implement a simple version of str_to_lower() using replace_all().

Answer

replacements <- c("A" = "a", "B" = "b", "C" = "c", "D" = "d", "E" = "e",
                  "F" = "f", "G" = "g", "H" = "h", "I" = "i", "J" = "j", 
                  "K" = "k", "L" = "l", "M" = "m", "N" = "n", "O" = "o", 
                  "P" = "p", "Q" = "q", "R" = "r", "S" = "s", "T" = "t", 
                  "U" = "u", "V" = "v", "W" = "w", "X" = "x", "Y" = "y", 
                  "Z" = "z")
lower_words <- str_replace_all(words, pattern = replacements)
head(lower_words)
#> [1] "a"        "able"     "about"    "absolute" "accept"   "account"

Question 31

Switch the first and last letters in words. Which of those strings are still words?

Answer

First, make a vector of all the words with first and last letters swapped,

swapped <- str_replace_all(words, "^([A-Za-z])(.*)([A-Za-z])$", "\\3\\2\\1")

Next, find what of “swapped” is also in the original list using the function intersect(),

intersect(swapped, words)
#>  [1] "a"          "america"    "area"       "dad"        "dead"      
#>  [6] "lead"       "read"       "depend"     "god"        "educate"   
#> [11] "else"       "encourage"  "engine"     "europe"     "evidence"  
#> [16] "example"    "excuse"     "exercise"   "expense"    "experience"
#> [21] "eye"        "dog"        "health"     "high"       "knock"     
#> [26] "deal"       "level"      "local"      "nation"     "on"        
#> [31] "non"        "no"         "rather"     "dear"       "refer"     
#> [36] "remember"   "serious"    "stairs"     "test"       "tonight"   
#> [41] "transport"  "treat"      "trust"      "window"     "yesterday"

Alternatively, the regex can be written using the POSIX character class for letter ([[:alpha:]]):

swapped2 <- str_replace_all(words, "^([[:alpha:]])(.*)([[:alpha:]])$", "\\3\\2\\1")
intersect(swapped2, words)
#>  [1] "a"          "america"    "area"       "dad"        "dead"      
#>  [6] "lead"       "read"       "depend"     "god"        "educate"   
#> [11] "else"       "encourage"  "engine"     "europe"     "evidence"  
#> [16] "example"    "excuse"     "exercise"   "expense"    "experience"
#> [21] "eye"        "dog"        "health"     "high"       "knock"     
#> [26] "deal"       "level"      "local"      "nation"     "on"        
#> [31] "non"        "no"         "rather"     "dear"       "refer"     
#> [36] "remember"   "serious"    "stairs"     "test"       "tonight"   
#> [41] "transport"  "treat"      "trust"      "window"     "yesterday"

Question 32

Split up a string like "apples, pears, and bananas" into individual components.

Answer

x <- c("apples, pears, and bananas")
str_split(x, ", +(and +)?")[[1]]
#> [1] "apples"  "pears"   "bananas"

Question 33

Why is it better to split up by boundary("word") than " "?

Answer

Splitting by boundary("word") is a more sophisticated method to split a string into words. It recognizes non-space punctuation that splits words, and also removes punctuation while retaining internal non-letter characters that are parts of the word, e.g., “can’t” See the ICU website for a description of the set of rules that are used to determine word boundaries.

Consider this sentence from the official Unicode Report on word boundaries,

sentence <- "The quick (“brown”) fox can’t jump 32.3 feet, right?"

Splitting the string on spaces considers will group the punctuation with the words,

str_split(sentence, " ")
#> [[1]]
#> [1] "The"       "quick"     "(“brown”)" "fox"       "can’t"     "jump"     
#> [7] "32.3"      "feet,"     "right?"

However, splitting the string using boundary("word") correctly removes punctuation, while not separating “32.2” and “can’t”,

str_split(sentence, boundary("word"))
#> [[1]]
#> [1] "The"   "quick" "brown" "fox"   "can’t" "jump"  "32.3"  "feet"  "right"

Question 34

What does splitting with an empty string ("") do? Experiment, and then read the documentation.

Answer

str_split("ab. cd|agt", "")[[1]]
#>  [1] "a" "b" "." " " "c" "d" "|" "a" "g" "t"

It splits the string into individual characters.

Question 35

How would you find all strings containing \ with regex() vs. with fixed()?

Answer

str_subset(c("a\\b", "ab"), "\\\\")
#> [1] "a\\b"
str_subset(c("a\\b", "ab"), fixed("\\"))
#> [1] "a\\b"

Question 36

What are the five most common words in sentences?

Answer

Using str_extract_all() with the argument boundary("word") will extract all words. The rest of the code uses dplyr functions to count words and find the most common words.

tibble(word = unlist(str_extract_all(sentences, boundary("word")))) %>%
  mutate(word = str_to_lower(word)) %>%
  count(word, sort = TRUE) %>%
  head(5)
#> # A tibble: 5 x 2
#>   word      n
#>   <chr> <int>
#> 1 the     751
#> 2 a       202
#> 3 of      132
#> 4 to      123
#> 5 and     118

Question 37

Find the stringi functions that:

Count the number of words.
Find duplicated strings.
Generate random text.

Answer

The answer to each part follows.

To count the number of words use stringi::stri_count_words(). This code counts the words in the first five sentences of sentences.
```
stri_count_words(head(sentences))
#> [1] 8 8 9 9 7 7
```

The stringi::stri_duplicated() function finds duplicate strings.

stri_duplicated(c("the", "brown", "cow", "jumped", "over",
                           "the", "lazy", "fox"))
#> [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

The stringi package contains several functions beginning with stri_rand_* that generate random text. The function stringi::stri_rand_strings() generates random strings. The following code generates four random strings each of length five.

stri_rand_strings(4, 5)
#> [1] "5pb90" "SUHjl" "sA2JO" "CP3Oy"

The function stringi::stri_rand_shuffle() randomly shuffles the characters in the text.

stri_rand_shuffle("The brown fox jumped over the lazy cow.")
#> [1] "ot f.lween p   jzwoom xyucobhv daheerrT"

The function stringi::stri_rand_lipsum() generates lorem ipsum text. Lorem ipsum text is nonsense text often used as placeholder text in publishing. The following code generates one paragraph of placeholder text.

stri_rand_lipsum(1)
#> [1] "Lorem ipsum dolor sit amet, hac non metus cras nam vitae tempus proin, sed. Diam gravida viverra eros mauris, magna lacinia dui nullam. Arcu proin aenean fringilla sed sollicitudin hac neque, egestas condimentum massa, elementum vivamus. Odio eget litora molestie eget eros pulvinar ac. Vel nec nullam vivamus, sociosqu lectus varius eleifend. Vitae in. Conubia ut hac maximus amet, conubia sed. Volutpat vitae class cursus, elit mauris porta. Mauris lacus donec odio eget quam inceptos, ridiculus cursus, ad massa. Rhoncus hac aenean at id consectetur molestie vitae! Sed, primis mi dictum lacinia eros. Ligula, feugiat consequat ut vivamus ut morbi et. Dolor, eget eleifend nec magnis aliquam egestas. Sollicitudin venenatis et aptent rhoncus nisl platea ligula cum."

Question 38

How do you control the language that stri_sort() uses for sorting?

Answer

You can set a locale to use when sorting with either stri_sort(..., opts_collator=stri_opts_collator(locale = ...)) or stri_sort(..., locale = ...). In this example from the stri_sort() documentation, the sorted order of the character vector depends on the locale.

string1 <- c("hladny", "chladny")
stri_sort(string1, locale = "pl_PL")
#> [1] "chladny" "hladny"
stri_sort(string1, locale = "sk_SK")
#> [1] "hladny"  "chladny"

The output of stri_opts_collator() can also be used for the locale argument of str_sort.

stri_sort(string1, opts_collator = stri_opts_collator(locale = "pl_PL"))
#> [1] "chladny" "hladny"
stri_sort(string1, opts_collator = stri_opts_collator(locale = "sk_SK"))
#> [1] "hladny"  "chladny"

The stri_opts_collator() provides finer grained control over how strings are sorted. In addition to setting the locale, it has options to customize how cases, unicode, accents, and numeric values are handled when comparing strings.

string2 <- c("number100", "number2")
stri_sort(string2)
#> [1] "number100" "number2"
stri_sort(string2, opts_collator = stri_opts_collator(numeric = TRUE))
#> [1] "number2"   "number100"

14 Strings

14.1 Introduction

14.2 String basics

Exercise 14.2.1

Exercise 14.2.2

Exercise 14.2.3

Exercise 14.2.4

Exercise 14.2.5

Exercise 14.2.6

14.3 Matching patterns with regular expressions

14.3.1 Basic matches

Exercise 14.3.1.1

Exercise 14.3.1.2

Exercise 14.3.1.3

14.3.2 Anchors

Exercise 14.3.2.1

Exercise 14.3.2.2

14.3.3 Character classes and alternatives

Exercise 14.3.3.1

Exercise 14.3.3.2

Exercise 14.3.3.3

Exercise 14.3.3.4

Exercise 14.3.3.5

14.3.4 Repetition

Exercise 14.3.4.1

Exercise 14.3.4.2

Exercise 14.3.4.3

Exercise 14.3.4.4

14.3.5 Grouping and backreferences

Exercise 14.3.5.1

Exercise 14.3.5.2

14.4 Tools

14.4.1 Detect matches

Exercise 14.4.1.1

Exercise 14.4.1.2

14.4.2 Extract matches

Exercise 14.4.2.1

Exercise 14.4.2.2

14.4.3 Grouped matches

Exercise 14.4.3.1

Exercise 14.4.3.2

14.4.4 Replacing matches

Exercise 14.4.4.1

Exercise 14.4.4.2

Exercise 14.4.4.3

14.4.5 Splitting

Exercise 14.4.5.1

Exercise 14.4.5.2

Exercise 14.4.5.3

14.4.6 Find matches

14.5 Other types of pattern

Exercise 14.5.1

Exercise 14.5.2

14.6 Other uses of regular expressions

14.7 stringi

Exercise 14.7.1

Exercise 14.7.2