I need your help!

If you find any typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

Opening an issue or submitting a pull request on GitHub

Hypothesis Adding an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

20 Vectors

20.1 Introduction

20.2 Vector basics

No exercises

20.3 Important types of atomic vector

Exercise 20.3.1

Describe the difference between is.finite(x) and !is.infinite(x).

To find out, try the functions on a numeric vector that includes at least one number and the four special values (NA, NaN, Inf, -Inf).

The is.finite() function considers non-missing numeric values to be finite, and missing (NA), not a number (NaN), and positive (Inf) and negative infinity (-Inf) to not be finite. The is.infinite() behaves slightly differently. It considers Inf and -Inf to be infinite, and everything else, including non-missing numbers, NA, and NaN to not be infinite. See Table 20.1.

Table 20.1: Results of is.finite() and is.infinite() for numeric and special values.
is.finite() is.infinite()
1 TRUE FALSE
NA FALSE FALSE
NaN FALSE FALSE
Inf FALSE TRUE

Exercise 20.3.2

Read the source code for dplyr::near() (Hint: to see the source code, drop the ()). How does it work?

The source for dplyr::near is:

Instead of checking for exact equality, it checks that two numbers are within a certain tolerance, tol. By default the tolerance is set to the square root of .Machine$double.eps, which is the smallest floating point number that the computer can represent.

Exercise 20.3.3

A logical vector can take 3 possible values. How many possible values can an integer vector take? How many possible values can a double take? Use Google to do some research.

For integers vectors, R uses a 32-bit representation. This means that it can represent up to \(2^{32}\) different values with integers. One of these values is set aside for NA_integer_. From the help for integer.

Note that current implementations of R use 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly.

The range of integers values that R can represent in an integer vector is \(\pm 2^{31} - 1\),

The maximum integer is \(2^{31} - 1\) rather than \(2^{32}\) because 1 bit is used to represent the sign (\(+\), \(-\)) and one value is used to represent NA_integer_.

If you try to represent an integer greater than that value, R will return NA values.

However, you can represent that value (exactly) with a numeric vector at the cost of about two times the memory.

The same is true for the negative of the integer max.

For double vectors, R uses a 64-bit representation. This means that they can hold up to \(2^{64}\) values exactly. However, some of those values are allocated to special values such as -Inf, Inf, NA_real_, and NaN. From the help for double:

All R platforms are required to work with values conforming to the IEC 60559 (also known as IEEE 754) standard. This basically works with a precision of 53 bits, and represents to that precision a range of absolute values from about 2e-308 to 2e+308. It also has special values NaN (many of them), plus and minus infinity and plus and minus zero (although R acts as if these are the same). There are also denormal(ized) (or subnormal) numbers with absolute values above or below the range given above but represented to less precision.

The details of floating point representation and arithmetic are complicated, beyond the scope of this question, and better discussed in the references provided below. The double can represent numbers in the range of about \(\pm 2 \times 10^{308}\), which is provided in

Many other details for the implementation of the double vectors are given in the .Machine variable (and its documentation). These include the base (radix) of doubles,

the number of bits used for the significand (mantissa),

the number of bits used in the exponent,

and the smallest positive and negative numbers not equal to zero,

Exercise 20.3.4

Brainstorm at least four functions that allow you to convert a double to an integer. How do they differ? Be precise.

The difference between to convert a double to an integer differ in how they deal with the fractional part of the double. There are are a variety of rules that could be used to do this.

  • Round down, towards \(-\infty\). This is also called taking the floor of a number. This is the method the floor() function uses.

  • Round up, towards \(+\infty\). This is also called taking the ceiling. This is the method the ceiling() function uses.

  • Round towards zero. This is the method that the trunc() and as.integer() functions use.

  • Round away from zero.

  • Round to the nearest integer. There several different methods for handling ties, defined as numbers with a fractional part of 0.5.

    • Round half down, towards \(-\infty\).
    • Round half up, towards \(+\infty\).
    • Round half towards zero
    • Round half away from zero
    • Round half towards the even integer. This is the method that the round() function uses.
    • Round half towards the odd integer.
function(x, method) {
  if (method == "round down") {
    floor(x)
  } else if (method == "round up") {
    ceiling(x)
  } else if (method == "round towards zero") {
    trunc(x)
  } else if (method == "round away from zero") {
    sign(x) * ceiling(abs(x))
  } else if (method == "nearest, round half up") {
    floor(x + 0.5)
  } else if (method == "nearest, round half down") {
    ceiling(x - 0.5)
  } else if (method == "nearest, round half towards zero") {
    sign(x) * ceiling(abs(x) - 0.5)
  } else if (method == "nearest, round half away from zero") {
    sign(x) * floor(abs(x) + 0.5)
  } else if (method == "nearest, round half to even") {
    round(x, digits = 0)
  } else if (method == "nearest, round half to odd") {
    case_when(
      # smaller integer is odd - round half down
      floor(x) %% 2 ~ ceiling(x - 0.5),
      # otherwise, round half up 
      TRUE ~ floor(x + 0.5)
    )
  } else if (method == "nearest, round half randomly") {
    round_half_up <- sample(c(TRUE, FALSE), length(x), replace = TRUE)
    y <- x
    y[round_half_up] <- ceiling(x[round_half_up] - 0.5)
    y[!round_half_up] <- floor(x[!round_half_up] + 0.5)
    y
  }
}
#> function(x, method) {
#>   if (method == "round down") {
#>     floor(x)
#>   } else if (method == "round up") {
#>     ceiling(x)
#>   } else if (method == "round towards zero") {
#>     trunc(x)
#>   } else if (method == "round away from zero") {
#>     sign(x) * ceiling(abs(x))
#>   } else if (method == "nearest, round half up") {
#>     floor(x + 0.5)
#>   } else if (method == "nearest, round half down") {
#>     ceiling(x - 0.5)
#>   } else if (method == "nearest, round half towards zero") {
#>     sign(x) * ceiling(abs(x) - 0.5)
#>   } else if (method == "nearest, round half away from zero") {
#>     sign(x) * floor(abs(x) + 0.5)
#>   } else if (method == "nearest, round half to even") {
#>     round(x, digits = 0)
#>   } else if (method == "nearest, round half to odd") {
#>     case_when(
#>       # smaller integer is odd - round half down
#>       floor(x) %% 2 ~ ceiling(x - 0.5),
#>       # otherwise, round half up 
#>       TRUE ~ floor(x + 0.5)
#>     )
#>   } else if (method == "nearest, round half randomly") {
#>     round_half_up <- sample(c(TRUE, FALSE), length(x), replace = TRUE)
#>     y <- x
#>     y[round_half_up] <- ceiling(x[round_half_up] - 0.5)
#>     y[!round_half_up] <- floor(x[!round_half_up] + 0.5)
#>     y
#>   }
#> }
#> <environment: 0x2b114b8>

See the Wikipedia articles, Rounding and IEEE floating point for more discussion of these rounding rules.

For rounding, R and many programming languages use the IEEE standard. This method is called “round to nearest, ties to even.”8 This rule rounds ties, numbers with a remainder of 0.5, to the nearest even number. In this rule, half the ties are rounded up, and half are rounded down. The following function, round2(), manually implements the “round to nearest, ties to even” method.

This rounding method may be different than the one you learned in grade school, which is, at least for me, was to always round ties upwards, or, alternatively away from zero. This rule is called the “round half up” rule. The problem with the “round half up” rule is that it is biased upwards for positive numbers. Rounding to nearest with ties towards even is not. Consider this sequence which sums to zero.

A nice property of rounding preserved that sum. Using the “ties towards even”, the sum is still zero. However, the “ties towards \(+\infty\)” produces a non-zero number.

Rounding rules can have real world impacts. One notable example was that in 1983, the Vancouver stock exchange adjusted its index from 524.811 to 1098.892 to correct for accumulated error due to rounding to three decimal points (see Vancouver Stock Exchange). This site lists several more examples of the dangers of rounding rules.

Exercise 20.3.5

What functions from the readr package allow you to turn a string into logical, integer, and double vector?

The function parse_logical() parses logical values, which can appear as variations of TRUE/FALSE or 1/0.

The function parse_integer() parses integer values.

However, if there are any non-numeric characters in the string, including currency symbols, commas, and decimals, parse_integer() will raise an error.

The function parse_number() parses numeric values. Unlike parse_integer(), the function parse_number() is more forgiving about the format of the numbers. It ignores all non-numeric characters before or after the first number, as with "$1,000.00" in the example. Within the number, parse_number() will only ignore grouping marks such as ",". This allows it to easily parse numeric fields that include currency symbols and comma separators in number strings without any intervention by the user.

20.4 Using atomic vectors

Exercise 20.4.1

What does mean(is.na(x)) tell you about a vector x? What about sum(!is.finite(x))?

I’ll use the numeric vector x to compare the behaviors of is.na() and is.finite(). It contains numbers (-1, 0, 1) as well as all the special numeric values: infinity (Inf), missing (NA), and not-a-number (NaN).

The expression mean(is.na(x)) calculates the proportion of missing (NA) and not-a-number NaN values in a vector:

The result of 0.286 is equal to 2 / 7 as expected. There are seven elements in the vector x, and two elements that are either NA or NaN.

The expression sum(!is.finite(x)) calculates the number of elements in the vector that are equal to missing (NA), not-a-number (NaN), or infinity (Inf).

Review the Numeric section for the differences between is.na() and is.finite().

Exercise 20.4.2

Carefully read the documentation of is.vector(). What does it actually test for? Why does is.atomic() not agree with the definition of atomic vectors above?

The function is.vector() only checks whether the object has no attributes other than names. Thus a list is a vector:

But any object that has an attribute (other than names) is not:

The idea behind this is that object oriented classes will include attributes, including, but not limited to "class".

The function is.atomic() explicitly checks whether an object is one of the atomic types (“logical”, “integer”, “numeric”, “complex”, “character”, and “raw”) or NULL.

The function is.atomic() will consider objects to be atomic even if they have extra attributes.

Exercise 20.4.3

Compare and contrast setNames() with purrr::set_names().

The function setNames() takes two arguments, a vector to be named and a vector of names to apply to its elements.

You can use the values of the vector as its names if the nm argument is used.

The function set_names() has more ways to set the names than setNames(). The names can be specified in the same manner as setNames().

The names can also be specified as unnamed arguments,

The function set_names() will name an object with itself if no nm argument is provided (the opposite of setNames() behavior).

The biggest difference between set_names() and setNames() is that set_names() allows for using a function or formula to transform the existing names.

The set_names() function also checks that the length of the names argument is the same length as the vector that is being named, and will raise an error if it is not.

The setNames() function will allow the names to be shorter than the vector being named, and will set the missing names to NA.

Exercise 20.4.4

Create functions that take a vector as input and returns:

  1. The last value. Should you use [ or [[?
  2. The elements at even numbered positions.
  3. Every element except the last value.
  4. Only even numbers (and no missing values).

The answers to the parts follow.

  1. This function find the last value in a vector.

    The function uses [[ in order to extract a single element.

  2. This function returns the elements at even number positions.

  3. This function returns a vector with every element except the last.

    We should also confirm that the function works with some edge cases, like a vector with one element, and a vector with zero elements.

    In both these cases, not_last() correctly returns an empty vector.

  4. This function returns the elements of a vector that are even numbers.

    We could improve this function by handling the special numeric values: NA, NaN, Inf. However, first we need to decide how to handle them. Neither NaN nor Inf are numbers, and so they are neither even nor odd. In other words, since NaN nor Inf aren’t even numbers, they aren’t even numbers. What about NA? Well, we don’t know. NA is a number, but we don’t know its value. The missing number could be even or odd, but we don’t know. Another reason to return NA is that it is consistent with the behavior of other R functions, which generally return NA values instead of dropping them.

Exercise 20.4.5

Why is x[-which(x > 0)] not the same as x[x <= 0]?

These expressions differ in the way that they treat missing values. Let’s test how they work by creating a vector with positive and negative integers, and special values (NA, NaN, and Inf). These values should encompass all relevant types of values that these expressions would encounter.

The expressions x[-which(x > 0)] and x[x <= 0] return the same values except for a NaN instead of an NA in the expression using which.

So what is going on here? Let’s work through each part of these expressions and see where the different occurs. Let’s start with the expression x[x <= 0].

Recall how the logical relational operators (<, <=, ==, !=, >, >=) treat NA values. Any relational operation that includes a NA returns an NA. Is NA <= 0? We don’t know because it depends on the unknown value of NA, so the answer is NA. This same argument applies to NaN. Asking whether NaN <= 0 does not make sense because you can’t compare a number to “Not a Number”.

Now recall how indexing treats NA values. Indexing can take a logical vector as an input. When the indexing vector is logical, the output vector includes those elements where the logical vector is TRUE, and excludes those elements where the logical vector is FALSE. Logical vectors can also include NA values, and it is not clear how they should be treated. Well, since the value is NA, it could be TRUE or FALSE, we don’t know. Keeping elements with NA would treat the NA as TRUE, and dropping them would treat the NA as FALSE.
The way R decides to handle the NA values so that they are treated differently than TRUE or FALSE values is to include elements where the indexing vector is NA, but set their values to NA.

Now consider the expression x[-which(x > 0)]. As before, to understand this expression we’ll work from the inside out. Consider x > 0.

As with x <= 0, it returns NA for comparisons involving NA and NaN.

What does which() do?

The which() function returns the indexes for which the argument is TRUE. This means that it is not including the indexes for which the argument is FALSE or NA.

Now consider the full expression x[-which(x > 0)]? The which() function returned a vector of integers. How does indexing treat negative integers?

If indexing gets a vector of positive integers, it will select those indexes; if it receives a vector of negative integers, it will drop those indexes. Thus, x[-which(x > 0)] ends up dropping the elements for which x > 0 is true, and keeps all the other elements and their original values, including NA and NaN.

There’s one other special case that we should consider. How do these two expressions work with an empty vector?

Thankfully, they both handle empty vectors the same.

This exercise is a reminder to always test your code. Even though these two expressions looked equivalent, they are not in practice. And when you do test code, consider both how it works on typical values as well as special values and edge cases, like a vector with NA or NaN or Inf values, or an empty vector. These are where unexpected behavior is most likely to occur.

Exercise 20.4.6

What happens when you subset with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?

Let’s consider the named vector,

If we subset it by an integer larger than its length, it returns a vector of missing values.

This also applies to ranges.

If some indexes are larger than the length of the vector, those elements are NA.

Likewise, when [ is provided names not in the vector’s names, it will return NA for those elements.

Though not yet discussed much in this chapter, the [[ behaves differently. With an atomic vector, if [[ is given an index outside the range of the vector or an invalid name, it raises an error.

20.5 Recursive vectors (lists)

Exercise 20.5.1

Draw the following lists as nested sets:

  1. list(a, b, list(c, d), list(e, f))
  2. list(list(list(list(list(list(a))))))

There are a variety of ways to draw these graphs. The original diagrams in R for Data Science were produced with Graffle. You could also use various diagramming, drawing, or presentation software, including Adobe Illustrator, Inkscape, PowerPoint, Keynote, and Google Slides.

For these examples, I generated these diagrams programmatically using the DiagrammeR R package to render Graphviz diagrams.

  1. The nested set diagram for list(a, b, list(c, d), list(e, f)) is:9

  2. The nested set diagram for list(list(list(list(list(list(a)))))) is:

Exercise 20.5.2

What happens if you subset a tibble as if you’re subsetting a list? What are the key differences between a list and a tibble?

Subsetting a tibble works the same way as a list; a data frame can be thought of as a list of columns. The key difference between a list and a tibble is that all the elements (columns) of a tibble must have the same length (number of rows). Lists can have vectors with different lengths as elements.

20.6 Attributes

No exercises

20.7 Augmented vectors

Exercise 20.7.1

What does hms::hms(3600) return? How does it print? What primitive type is the augmented vector built on top of? What attributes does it use?

hms::hms returns an object of class, and prints the time in “%H:%M:%S” format.

The primitive type is a double

The attributes is uses are "units" and "class".

Exercise 20.7.2

Try and make a tibble that has columns with different lengths. What happens?

If I try to create a tibble with a scalar and column of a different length there are no issues, and the scalar is repeated to the length of the longer vector.

However, if I try to create a tibble with two vectors of different lengths (other than one), the tibble function throws an error.

Exercise 20.7.3

Based on the definition above, is it OK to have a list as a column of a tibble?

If I didn’t already know the answer, what I would do is try it out. From the above, the error message was about vectors having different lengths. But there is nothing that prevents a tibble from having vectors of different types: doubles, character, integers, logical, factor, date. The later are still atomic, but they have additional attributes. So, maybe there won’t be an issue with a list vector as long as it is the same length.

It works! I even used a list with heterogeneous types and there wasn’t an issue. In following chapters we’ll see that list vectors can be very useful: for example, when processing many different models.


  1. See the documentation for .Machine$double.rounding.

  2. These diagrams were created with the DiagrammeR package.