If you find any typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

Opening an issue or submitting a pull request on GitHub

Adding an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

20 Vectors

20.1 Introduction

library("tidyverse")

No exercises

20.3 Important types of atomic vector

Exercise 20.3.1

Describe the difference between is.finite(x) and !is.infinite(x).

To find out, try the functions on a numeric vector that includes at least one number and the four special values (NA, NaN, Inf, -Inf).

x <- c(0, NA, NaN, Inf, -Inf)
is.finite(x)
#> [1]  TRUE FALSE FALSE FALSE FALSE
!is.infinite(x)
#> [1]  TRUE  TRUE  TRUE FALSE FALSE

The is.finite() function considers non-missing numeric values to be finite, and missing (NA), not a number (NaN), and positive (Inf) and negative infinity (-Inf) to not be finite. The is.infinite() behaves slightly differently. It considers Inf and -Inf to be infinite, and everything else, including non-missing numbers, NA, and NaN to not be infinite. See Table 20.1.

Table 20.1: Results of is.finite() and is.infinite() for numeric and special values.
is.finite() is.infinite()
1 TRUE FALSE
NA FALSE FALSE
NaN FALSE FALSE
Inf FALSE TRUE

Exercise 20.3.2

Read the source code for dplyr::near() (Hint: to see the source code, drop the ()). How does it work?

The source for dplyr::near is:

dplyr::near
#> function (x, y, tol = .Machine$double.eps^0.5) #> { #> abs(x - y) < tol #> } #> <bytecode: 0x5d8e8a8> #> <environment: namespace:dplyr> Instead of checking for exact equality, it checks that two numbers are within a certain tolerance, tol. By default the tolerance is set to the square root of .Machine$double.eps, which is the smallest floating point number that the computer can represent.

Exercise 20.3.3

A logical vector can take 3 possible values. How many possible values can an integer vector take? How many possible values can a double take? Use Google to do some research.

For integers vectors, R uses a 32-bit representation. This means that it can represent up to $$2^{32}$$ different values with integers. One of these values is set aside for NA_integer_. From the help for integer.

Note that current implementations of R use 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly.

The range of integers values that R can represent in an integer vector is $$\pm 2^{31} - 1$$,

.Machine$integer.max #> [1] 2147483647 The maximum integer is $$2^{31} - 1$$ rather than $$2^{32}$$ because 1 bit is used to represent the sign ($$+$$, $$-$$) and one value is used to represent NA_integer_. If you try to represent an integer greater than that value, R will return NA values. .Machine$integer.max + 1L
#> Warning in .Machine$integer.max + 1L: NAs produced by integer overflow #> [1] NA However, you can represent that value (exactly) with a numeric vector at the cost of about two times the memory. as.numeric(.Machine$integer.max) + 1
#> [1] 2.15e+09

The same is true for the negative of the integer max.

-.Machine$integer.max - 1L #> Warning in -.Machine$integer.max - 1L: NAs produced by integer overflow
#> [1] NA

For double vectors, R uses a 64-bit representation. This means that they can hold up to $$2^{64}$$ values exactly. However, some of those values are allocated to special values such as -Inf, Inf, NA_real_, and NaN. From the help for double:

All R platforms are required to work with values conforming to the IEC 60559 (also known as IEEE 754) standard. This basically works with a precision of 53 bits, and represents to that precision a range of absolute values from about 2e-308 to 2e+308. It also has special values NaN (many of them), plus and minus infinity and plus and minus zero (although R acts as if these are the same). There are also denormal(ized) (or subnormal) numbers with absolute values above or below the range given above but represented to less precision.

The details of floating point representation and arithmetic are complicated, beyond the scope of this question, and better discussed in the references provided below. The double can represent numbers in the range of about $$\pm 2 \times 10^{308}$$, which is provided in

.Machine$double.xmax #> [1] 1.8e+308 Many other details for the implementation of the double vectors are given in the .Machine variable (and its documentation). These include the base (radix) of doubles, .Machine$double.base
#> [1] 2

the number of bits used for the significand (mantissa),

.Machine$double.digits #> [1] 53 the number of bits used in the exponent, .Machine$double.exponent
#> [1] 11

and the smallest positive and negative numbers not equal to zero,

.Machine$double.eps #> [1] 2.22e-16 .Machine$double.neg.eps
#> [1] 1.11e-16

Exercise 20.3.4

Brainstorm at least four functions that allow you to convert a double to an integer. How do they differ? Be precise.

The difference between to convert a double to an integer differ in how they deal with the fractional part of the double. There are are a variety of rules that could be used to do this.

• Round down, towards $$-\infty$$. This is also called taking the floor of a number. This is the method the floor() function uses.

• Round up, towards $$+\infty$$. This is also called taking the ceiling. This is the method the ceiling() function uses.

• Round towards zero. This is the method that the trunc() and as.integer() functions use.

• Round away from zero.

• Round to the nearest integer. There several different methods for handling ties, defined as numbers with a fractional part of 0.5.

• Round half down, towards $$-\infty$$.
• Round half up, towards $$+\infty$$.
• Round half towards zero
• Round half away from zero
• Round half towards the even integer. This is the method that the round() function uses.
• Round half towards the odd integer.
function(x, method) {
if (method == "round down") {
floor(x)
} else if (method == "round up") {
ceiling(x)
} else if (method == "round towards zero") {
trunc(x)
} else if (method == "round away from zero") {
sign(x) * ceiling(abs(x))
} else if (method == "nearest, round half up") {
floor(x + 0.5)
} else if (method == "nearest, round half down") {
ceiling(x - 0.5)
} else if (method == "nearest, round half towards zero") {
sign(x) * ceiling(abs(x) - 0.5)
} else if (method == "nearest, round half away from zero") {
sign(x) * floor(abs(x) + 0.5)
} else if (method == "nearest, round half to even") {
round(x, digits = 0)
} else if (method == "nearest, round half to odd") {
case_when(
# smaller integer is odd - round half down
floor(x) %% 2 ~ ceiling(x - 0.5),
# otherwise, round half up
TRUE ~ floor(x + 0.5)
)
} else if (method == "nearest, round half randomly") {
round_half_up <- sample(c(TRUE, FALSE), length(x), replace = TRUE)
y <- x
y[round_half_up] <- ceiling(x[round_half_up] - 0.5)
y[!round_half_up] <- floor(x[!round_half_up] + 0.5)
y
}
}
#> function(x, method) {
#>   if (method == "round down") {
#>     floor(x)
#>   } else if (method == "round up") {
#>     ceiling(x)
#>   } else if (method == "round towards zero") {
#>     trunc(x)
#>   } else if (method == "round away from zero") {
#>     sign(x) * ceiling(abs(x))
#>   } else if (method == "nearest, round half up") {
#>     floor(x + 0.5)
#>   } else if (method == "nearest, round half down") {
#>     ceiling(x - 0.5)
#>   } else if (method == "nearest, round half towards zero") {
#>     sign(x) * ceiling(abs(x) - 0.5)
#>   } else if (method == "nearest, round half away from zero") {
#>     sign(x) * floor(abs(x) + 0.5)
#>   } else if (method == "nearest, round half to even") {
#>     round(x, digits = 0)
#>   } else if (method == "nearest, round half to odd") {
#>     case_when(
#>       # smaller integer is odd - round half down
#>       floor(x) %% 2 ~ ceiling(x - 0.5),
#>       # otherwise, round half up
#>       TRUE ~ floor(x + 0.5)
#>     )
#>   } else if (method == "nearest, round half randomly") {
#>     round_half_up <- sample(c(TRUE, FALSE), length(x), replace = TRUE)
#>     y <- x
#>     y[round_half_up] <- ceiling(x[round_half_up] - 0.5)
#>     y[!round_half_up] <- floor(x[!round_half_up] + 0.5)
#>     y
#>   }
#> }
#> <environment: 0x2b114b8>
tibble(
x = c(1.8, 1.5, 1.2, 0.8, 0.5, 0.2,
-0.2, -0.5, -0.8, -1.2, -1.5, -1.8),
Round down = floor(x),
Round up = ceiling(x),
Round towards zero = trunc(x),
Nearest, round half to even = round(x)
)
#> # A tibble: 12 x 5
#>       x Round down Round up Round towards zero Nearest, round half to ev…
#>   <dbl>        <dbl>      <dbl>                <dbl>                       <dbl>
#> 1   1.8            1          2                    1                           2
#> 2   1.5            1          2                    1                           2
#> 3   1.2            1          2                    1                           1
#> 4   0.8            0          1                    0                           1
#> 5   0.5            0          1                    0                           0
#> 6   0.2            0          1                    0                           0
#> # … with 6 more rows

See the Wikipedia articles, Rounding and IEEE floating point for more discussion of these rounding rules.

For rounding, R and many programming languages use the IEEE standard. This method is called “round to nearest, ties to even.”8 This rule rounds ties, numbers with a remainder of 0.5, to the nearest even number. In this rule, half the ties are rounded up, and half are rounded down. The following function, round2(), manually implements the “round to nearest, ties to even” method.

x <- seq(-10, 10, by = 0.5)

round2 <- function(x, to_even = TRUE) {
q <- x %/% 1
r <- x %% 1
q + (r >= 0.5)
}
x <- c(-12.5, -11.5, 11.5, 12.5)
round(x)
#> [1] -12 -12  12  12
round2(x, to_even = FALSE)
#> [1] -12 -11  12  13

This rounding method may be different than the one you learned in grade school, which is, at least for me, was to always round ties upwards, or, alternatively away from zero. This rule is called the “round half up” rule. The problem with the “round half up” rule is that it is biased upwards for positive numbers. Rounding to nearest with ties towards even is not. Consider this sequence which sums to zero.

x <- seq(-100.5, 100.5, by = 1)
x
#>   [1] -100.5  -99.5  -98.5  -97.5  -96.5  -95.5  -94.5  -93.5  -92.5  -91.5
#>  [11]  -90.5  -89.5  -88.5  -87.5  -86.5  -85.5  -84.5  -83.5  -82.5  -81.5
#>  [21]  -80.5  -79.5  -78.5  -77.5  -76.5  -75.5  -74.5  -73.5  -72.5  -71.5
#>  [31]  -70.5  -69.5  -68.5  -67.5  -66.5  -65.5  -64.5  -63.5  -62.5  -61.5
#>  [41]  -60.5  -59.5  -58.5  -57.5  -56.5  -55.5  -54.5  -53.5  -52.5  -51.5
#>  [51]  -50.5  -49.5  -48.5  -47.5  -46.5  -45.5  -44.5  -43.5  -42.5  -41.5
#>  [61]  -40.5  -39.5  -38.5  -37.5  -36.5  -35.5  -34.5  -33.5  -32.5  -31.5
#>  [71]  -30.5  -29.5  -28.5  -27.5  -26.5  -25.5  -24.5  -23.5  -22.5  -21.5
#>  [81]  -20.5  -19.5  -18.5  -17.5  -16.5  -15.5  -14.5  -13.5  -12.5  -11.5
#>  [91]  -10.5   -9.5   -8.5   -7.5   -6.5   -5.5   -4.5   -3.5   -2.5   -1.5
#> [101]   -0.5    0.5    1.5    2.5    3.5    4.5    5.5    6.5    7.5    8.5
#> [111]    9.5   10.5   11.5   12.5   13.5   14.5   15.5   16.5   17.5   18.5
#> [121]   19.5   20.5   21.5   22.5   23.5   24.5   25.5   26.5   27.5   28.5
#> [131]   29.5   30.5   31.5   32.5   33.5   34.5   35.5   36.5   37.5   38.5
#> [141]   39.5   40.5   41.5   42.5   43.5   44.5   45.5   46.5   47.5   48.5
#> [151]   49.5   50.5   51.5   52.5   53.5   54.5   55.5   56.5   57.5   58.5
#> [161]   59.5   60.5   61.5   62.5   63.5   64.5   65.5   66.5   67.5   68.5
#> [171]   69.5   70.5   71.5   72.5   73.5   74.5   75.5   76.5   77.5   78.5
#> [181]   79.5   80.5   81.5   82.5   83.5   84.5   85.5   86.5   87.5   88.5
#> [191]   89.5   90.5   91.5   92.5   93.5   94.5   95.5   96.5   97.5   98.5
#> [201]   99.5  100.5
sum(x)
#> [1] 0

A nice property of rounding preserved that sum. Using the “ties towards even”, the sum is still zero. However, the “ties towards $$+\infty$$” produces a non-zero number.

sum(x)
#> [1] 0
sum(round(x))
#> [1] 0
sum(round2(x))
#> [1] 101

Rounding rules can have real world impacts. One notable example was that in 1983, the Vancouver stock exchange adjusted its index from 524.811 to 1098.892 to correct for accumulated error due to rounding to three decimal points (see Vancouver Stock Exchange). This site lists several more examples of the dangers of rounding rules.

Exercise 20.3.5

What functions from the readr package allow you to turn a string into logical, integer, and double vector?

The function parse_logical() parses logical values, which can appear as variations of TRUE/FALSE or 1/0.

parse_logical(c("TRUE", "FALSE", "1", "0", "true", "t", "NA"))
#> [1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE    NA

The function parse_integer() parses integer values.

parse_integer(c("1235", "0134", "NA"))
#> [1] 1235  134   NA

However, if there are any non-numeric characters in the string, including currency symbols, commas, and decimals, parse_integer() will raise an error.

parse_integer(c("1000", "$1,000", "10.00")) #> Warning: 2 parsing failures. #> row col expected actual #> 2 -- an integer$1,000
#>   3  -- no trailing characters .00
#> [1] 1000   NA   NA
#> attr(,"problems")
#> # A tibble: 2 x 4
#>     row   col expected               actual
#>   <int> <int> <chr>                  <chr>
#> 1     2    NA an integer             $1,000 #> 2 3 NA no trailing characters .00 The function parse_number() parses numeric values. Unlike parse_integer(), the function parse_number() is more forgiving about the format of the numbers. It ignores all non-numeric characters before or after the first number, as with "$1,000.00" in the example. Within the number, parse_number() will only ignore grouping marks such as ",". This allows it to easily parse numeric fields that include currency symbols and comma separators in number strings without any intervention by the user.

#> [1] "secs"
#>
#> $class #> [1] "hms" "difftime" Exercise 20.7.2 Try and make a tibble that has columns with different lengths. What happens? If I try to create a tibble with a scalar and column of a different length there are no issues, and the scalar is repeated to the length of the longer vector. tibble(x = 1, y = 1:5) #> # A tibble: 5 x 2 #> x y #> <dbl> <int> #> 1 1 1 #> 2 1 2 #> 3 1 3 #> 4 1 4 #> 5 1 5 However, if I try to create a tibble with two vectors of different lengths (other than one), the tibble function throws an error. tibble(x = 1:3, y = 1:4) #> Error: Tibble columns must have compatible sizes. #> * Size 3: Existing data. #> * Size 4: Column y. #> ℹ Only values of size one are recycled. Exercise 20.7.3 Based on the definition above, is it OK to have a list as a column of a tibble? If I didn’t already know the answer, what I would do is try it out. From the above, the error message was about vectors having different lengths. But there is nothing that prevents a tibble from having vectors of different types: doubles, character, integers, logical, factor, date. The later are still atomic, but they have additional attributes. So, maybe there won’t be an issue with a list vector as long as it is the same length. tibble(x = 1:3, y = list("a", 1, list(1:3))) #> # A tibble: 3 x 2 #> x y #> <int> <list> #> 1 1 <chr [1]> #> 2 2 <dbl [1]> #> 3 3 <list [1]> It works! I even used a list with heterogeneous types and there wasn’t an issue. In following chapters we’ll see that list vectors can be very useful: for example, when processing many different models. 1. See the documentation for .Machine$double.rounding`.

2. These diagrams were created with the DiagrammeR package.