4 Prediction

Prerequisites

library("tidyverse")
library("lubridate")
library("stringr")
library("forcats")

The packages modelr and broom are used to wrangle the results of linear regressions,

library("broom")
library("modelr")

4.1 Predicting Election Outcomes

4.1.1 Loops in R

RStudio provides many features to help debugging, which will be useful in for loops and function: see this article for an example.

values <- c(2, 3, 6)
n <- length(values)
results <- rep(NA, n)
for (i in 1:n) {
  results[[i]] <- values[[i]] * 2
  cat(values[[i]], "times 2 is equal to", results[[i]], "\n")
}
#> 2 times 2 is equal to 4 
#> 3 times 2 is equal to 6 
#> 6 times 2 is equal to 12

Note that the above code uses the for loop for pedagogical purposes only, this could have simply been written

results <- values * 2
results
#> [1]  4  6 12

In general, avoid using for loops when there is a vectorized function.

But sticking with the for loop, there are several things that could be improved.

Avoid using the idiom 1:n in for loops. To see why, look what happens when values are empty:

values <- c()
n <- length(values)
results <- rep(NA, n)
for (i in 1:n) {
  cat("i = ", i, "\n")
  results[i] <- values[i] * 2
  cat(values[i], "times 2 is equal to", results[i], "\n")
}
#> i =  1
#> Error in results[i] <- values[i] * 2: replacement has length zero

Instead of not running a loop, as you would expect, it runs two loops, where i = 1, then i = 0. This edge case occurs more than you may think, especially if you are writing functions where you don’t know the length of the vector is ex ante.

The way to avoid this is to use either rdoc("base", "seq_len") or rdoc("base", "seq_along"), which will handle 0-length vectors correctly.

values <- c()
n <- length(values)
results <- rep(NA, n)
for (i in seq_along(values)) {
  results[i] <- values[i] * 2
}
print(results)
#> logical(0)

values <- c()
n <- length(values)
results <- rep(NA, n)
for (i in seq_len(n)) {
  results[i] <- values[i] * 2
}
print(results)
#> logical(0)

Also, note that the the result is logical(0). That’s because the NA missing value has class logical, and thus rep(NA, ...) returns a logical vector. It is better style to initialize the vector with the same data type that you will be using,

results <- rep(NA_real_, length(values))
results
#> numeric(0)
class(results)
#> [1] "numeric"

Often loops can be rewritten to use a map function. Read the R for Data Science chapter Iteration before proceeding.

To do so, we first write a function that will be applied to each element of the vector. When converting from a for loop to a function, this is usually simply the body of the for loop, though you may need to add arguments for any variables defined outside the body of the for loop. In this case,

mult_by_two <- function(x) {
  x * 2
}

We can now test that this function works on different values:

mult_by_two(0)
#> [1] 0
mult_by_two(2.5)
#> [1] 5
mult_by_two(-3)
#> [1] -6

At this point, we could replace the body of the for loop with this function:

values <- c(2, 4, 6)
n <- length(values)
results <- rep(NA, n)
for (i in seq_len(n)) {
  results[i] <- mult_by_two(values[i])
}
print(results)
#> [1]  4  8 12

This can be useful if the body of a for loop is many lines long.

However, this loop is still unwieldy code. We have to remember to define an empty vector results that is the same size as values to hold the results, and then correctly loop over all the values. We already saw how these steps have possibilities for errors. Functionals like map, apply a function to each element of a vector.

results <- map(values, mult_by_two)
results
#> [[1]]
#> [1] 4
#> 
#> [[2]]
#> [1] 8
#> 
#> [[3]]
#> [1] 12

The values of each element are correct, but map returns a list vector, not a numeric vector like we may have been expecting. If we want a numeric vector, use map_dbl,

results <- map_dbl(values, mult_by_two)
results
#> [1]  4  8 12

Also, instead of explicitly defining a function, like mult_by_two, we could have instead used an anonymous function with the functional. An anonymous function is a function that is not assigned to a name.

results <- map_dbl(values, function(x) x * 2)
results
#> [1]  4  8 12

The various purrr functions also will interpret formulas as functions where .x and .y are interpreted as (up to) two arguments.

results <- map_dbl(values, ~ .x * 2)
results
#> [1]  4  8 12

This is for parsimony and convenience; in the background, these functions are creating anonymous functions from the given formula.

QSS discusses several debugging strategies. The functional approach lends itself to easier debugging because the function can be tested with input values independently of the loop.

4.1.2 General Conditional Statements in R

See the R for Data Science section Conditional Execution for a more complete discussion of conditional execution.

If you are using conditional statements to assign values for data frame, see the dplyr functions if_else, recode, and case_when

The following code which uses a for loop,

values <- 1:5
n <- length(values)
results <- rep(NA_real_, n)
for (i in seq_len(n)) {
  x <- values[i]
  r <- x %% 2
  if (r == 0) {
    cat(x, "is even and I will perform addition", x, " + ", x, "\n")
    results[i] <- x + x
  } else {
    cat(x, "is even and I will perform multiplication", x, " * ", x, "\n")
    results[i] <- x * x
  }
}
#> 1 is even and I will perform multiplication 1  *  1 
#> 2 is even and I will perform addition 2  +  2 
#> 3 is even and I will perform multiplication 3  *  3 
#> 4 is even and I will perform addition 4  +  4 
#> 5 is even and I will perform multiplication 5  *  5
results
#> [1]  1  4  9  8 25

could be rewritten to use if_else,

if_else(values %% 2 == 0, values + values, values * values)
#> [1]  1  4  9  8 25

or using the map_dbl functional with a named function,

myfunc <- function(x) {
  if (x %% 2 == 0) {
    x + x
  } else {
    x * x
  }
}
map_dbl(values, myfunc)
#> [1]  1  4  9  8 25

or map_dbl with an anonymous function,

map_dbl(values, function(x) {
  if (x %% 2 == 0) {
    x + x
  } else {
    x * x
  }
})
#> [1]  1  4  9  8 25

4.1.3 Poll Predictions

Load the election polls by state for the 2008 US Presidential election,

data("polls08", package = "qss")
glimpse(polls08)
#> Observations: 1,332
#> Variables: 5
#> $ state    <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",...
#> $ Pollster <chr> "SurveyUSA-2", "Capital Survey-2", "SurveyUSA-2", "Ca...
#> $ Obama    <int> 36, 34, 35, 35, 39, 34, 36, 25, 35, 34, 37, 36, 36, 3...
#> $ McCain   <int> 61, 54, 62, 55, 60, 64, 58, 52, 55, 47, 55, 51, 49, 5...
#> $ middate  <date> 2008-10-27, 2008-10-15, 2008-10-08, 2008-10-06, 2008...

and the election results,

data("pres08", package = "qss")
glimpse(pres08)
#> Observations: 51
#> Variables: 5
#> $ state.name <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califo...
#> $ state      <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE...
#> $ Obama      <int> 39, 38, 45, 39, 61, 54, 61, 92, 62, 51, 47, 72, 36,...
#> $ McCain     <int> 60, 59, 54, 59, 37, 45, 38, 7, 37, 48, 52, 27, 62, ...
#> $ EV         <int> 9, 3, 10, 6, 55, 9, 7, 3, 3, 27, 15, 4, 4, 21, 11, ...

Compute Obama’s margin in polls and final election

polls08 <-
  polls08 %>% mutate(margin = Obama - McCain)
pres08 <-
  pres08 %>% mutate(margin = Obama - McCain)

To work with dates, the R package lubridate makes wrangling them much easier. See the R for Data Science chapter Dates and Times.

The function ymd will convert character strings like year-month-day and more into dates, as long as the order is (year, month, day). See dmy, mdy, and others for other ways to convert strings to dates.

x <- ymd("2008-11-04")
y <- ymd("2008/9/1")
x - y
#> Time difference of 64 days

However, note that in polls08, the date middate is already a date object,

class(polls08$middate)
#> [1] "Date"

The function read_csv by default will check character vectors to see if they have patterns that appear to be dates, and if so, will parse those columns as dates.

We’ll create a variable for election day

ELECTION_DAY <- ymd("2008-11-04")

and add a new column to poll08 with the days to the election

polls08 <- mutate(polls08, ELECTION_DAY - middate)

Although the code in the chapter uses a for loop, there is no reason to do so. We can accomplish the same task by merging the election results data to the polling data by state.

polls_w_results <- left_join(polls08,
                            select(pres08, state, elec_margin = margin),
                            by = "state") %>%
  mutate(error = elec_margin - margin)
glimpse(polls_w_results)
#> Observations: 1,332
#> Variables: 9
#> $ state                    <chr> "AL", "AL", "AL", "AL", "AL", "AL", "...
#> $ Pollster                 <chr> "SurveyUSA-2", "Capital Survey-2", "S...
#> $ Obama                    <int> 36, 34, 35, 35, 39, 34, 36, 25, 35, 3...
#> $ McCain                   <int> 61, 54, 62, 55, 60, 64, 58, 52, 55, 4...
#> $ middate                  <date> 2008-10-27, 2008-10-15, 2008-10-08, ...
#> $ margin                   <int> -25, -20, -27, -20, -21, -30, -22, -2...
#> $ `ELECTION_DAY - middate` <time> 8 days, 20 days, 27 days, 29 days, 4...
#> $ elec_margin              <int> -21, -21, -21, -21, -21, -21, -21, -2...
#> $ error                    <int> 4, -1, 6, -1, 0, 9, 1, 6, -1, -8, -3,...

To get the last poll in each state, arrange and filter on middate

last_polls <-
  polls_w_results %>%
  arrange(state, desc(middate)) %>%
  group_by(state) %>%
  slice(1)
last_polls
#> # A tibble: 51 x 9
#> # Groups:   state [51]
#>   state Pollster      Obama McCain middate    margin `ELECTION_DAY - midd…
#>   <chr> <chr>         <int>  <int> <date>      <int> <time>               
#> 1 AK    Research 200…    39     58 2008-10-29    -19 6                    
#> 2 AL    SurveyUSA-2      36     61 2008-10-27    -25 8                    
#> 3 AR    ARG-4            44     51 2008-10-29    - 7 6                    
#> 4 AZ    ARG-3            46     50 2008-10-29    - 4 6                    
#> 5 CA    SurveyUSA-3      60     36 2008-10-30     24 5                    
#> 6 CO    ARG-3            52     45 2008-10-29      7 6                    
#> # ... with 45 more rows, and 2 more variables: elec_margin <int>,
#> #   error <int>

Challenge: Instead of using the last poll, use the average of polls in the last week? Last month? How do the margins on the polls change over the election period?

To simplify things for later, let’s define a function rmse which calculates the root mean squared error, as defined in the book. See the R for Data Science chapter Functions for more on writing functions.

rmse <- function(actual, pred) {
  sqrt(mean( (actual - pred) ^ 2))
}

Now we can use rmse() to calculate the RMSE for all the final polls:

rmse(last_polls$margin, last_polls$elec_margin)
#> [1] 5.88

Or since we already have a variable error,

sqrt(mean(last_polls$error ^ 2))
#> [1] 5.88

The mean prediction error is

mean(last_polls$error)
#> [1] 1.08

This is slightly different than what is in the book due to the difference in the poll used as the final poll; many states have many polls on the last day.

I’ll choose bin widths of 1%, since that is fairly interpretable:

ggplot(last_polls, aes(x = error)) +
  geom_histogram(binwidth = 1, boundary = 0)

The text uses bin widths of 5%:

ggplot(last_polls, aes(x = error)) +
  geom_histogram(binwidth = 5, boundary = 0)

Challenge: What other ways could you visualize the results? How would you show all states? What about plotting the absolute or squared errors instead of the errors?

Challenge: What happens to prediction error if you average polls? Consider averaging back over time? What happens if you take the averages of the state poll average and average of all polls - does that improve prediction?

To create a scatter plots using the state abbreviations instead of points use geom_text instead of geom_point.

ggplot(last_polls, aes(x = margin, y = elec_margin, label = state)) +
  geom_abline(color = "white", size = 2) +
  geom_hline(yintercept = 0, color = "gray", size = 2) +
  geom_vline(xintercept = 0, color = "gray", size = 2) +
  geom_text() +
  coord_fixed() +
  labs(x = "Poll Results", y = "Actual Election Results")

We can create a confusion matrix as follows. Create a new column classification which shows how the poll’s classification was related to the actual election outcome (“true positive”, “false positive”, “true negative”, “false negative”). If there were two outcomes, then we would use the function. But with more than two outcomes, it is easier to use the dplyr function .

last_polls <-
  last_polls %>%
  ungroup() %>%
  mutate(classification =
           case_when(
             (.$margin > 0 & .$elec_margin > 0) ~ "true positive",
             (.$margin > 0 & .$elec_margin < 0) ~ "false positive",
             (.$margin < 0 & .$elec_margin < 0) ~ "true negative",
             (.$margin < 0 & .$elec_margin > 0) ~ "false negative"
           ))

You need to use . to refer to the data frame when using case_when() within mutate(). Also, we needed to first use in order to remove the grouping variable so mutate() will work.

Now simply count the number of polls in each category of classification:

last_polls %>%
  group_by(classification) %>%
  count()
#> # A tibble: 4 x 2
#> # Groups:   classification [4]
#>   classification     n
#>   <chr>          <int>
#> 1 false negative     2
#> 2 false positive     1
#> 3 true negative     21
#> 4 true positive     27

Which states were incorrectly predicted by the polls, and what was their margins?

last_polls %>%
  filter(classification %in% c("false positive", "false negative")) %>%
  select(state, margin, elec_margin, classification) %>%
  arrange(desc(elec_margin))
#> # A tibble: 3 x 4
#>   state margin elec_margin classification
#>   <chr>  <int>       <int> <chr>         
#> 1 IN        -5           1 false negative
#> 2 NC        -1           1 false negative
#> 3 MO         1          -1 false positive

What was the difference in the poll prediction of electoral votes and actual electoral votes? We hadn’t included the variable EV when we first merged, but that’s no problem, we’ll just merge again in order to grab that variable:

last_polls %>%
  left_join(select(pres08, state, EV), by = "state") %>%
  summarise(EV_pred = sum( (margin > 0) * EV),
            EV_actual = sum( (elec_margin > 0) * EV))
#> # A tibble: 1 x 2
#>   EV_pred EV_actual
#>     <int>     <int>
#> 1     349       364

data("pollsUS08", package = "qss")

pollsUS08 <- mutate(pollsUS08, DaysToElection = ELECTION_DAY - middate)

We’ll produce the seven-day averages slightly differently than the method used in the text. For all dates in the data, we’ll calculate the moving average. The code presented in QSS uses a for loop similar to the following:

all_dates <- seq(min(polls08$middate), ELECTION_DAY, by = "days")

# Number of poll days to use
POLL_DAYS <- 7

pop_vote_avg <- vector(length(all_dates), mode = "list")
for (i in seq_along(all_dates)) {
  date <- all_dates[i]
  # summarise the seven day
  week_data <-
     filter(polls08,
            as.integer(middate - date) <= 0,
            as.integer(middate - date) > - POLL_DAYS) %>%
     summarise(Obama = mean(Obama, na.rm = TRUE),
               McCain = mean(McCain, na.rm = TRUE))
  # add date for the observation
  week_data$date <- date
  pop_vote_avg[[i]] <- week_data
}

pop_vote_avg <- bind_rows(pop_vote_avg)

Write a function which takes a date, and calculates the days (set the default to 7 days) moving average using the dataset .data:

poll_ma <- function(date, .data, days = 7) {
  filter(.data,
        as.integer(middate - date) <= 0,
        as.integer(middate - date) > - !!days) %>%
  summarise(Obama = mean(Obama, na.rm = TRUE),
           McCain = mean(McCain, na.rm = TRUE)) %>%
  mutate(date = !!date)
}

The code above uses !!. This tells filter that days refers to a variable days in the calling environment, and not a column named days in the data frame. In this case, there wouldn’t be any ambiguities since there is not a column named days, but in general there can be ambiguities in the dplyr functions as to whether the names refer to columns in the data frame or variables in the environment calling the function. Read Programming with dplyr for an in-depth discussion of this.

This returns a one row data frame with the moving average for McCain and Obama on Nov 1, 2008.

poll_ma(as.Date("2008-11-01"), polls08)
#>   Obama McCain       date
#> 1  49.1   45.4 2008-11-01

Since we made days an argument to the function we could easily change the code to calculate other moving averages,

poll_ma(as.Date("2008-11-01"), polls08, days = 3)
#>   Obama McCain       date
#> 1  50.6   45.4 2008-11-01

Now use a functional to execute that function with all dates for which we want moving averages. The function poll_ma returns a data frame, and our ideal output is a data frame that stacks those data frames row-wise. So we will use the map_df function,

map_df(all_dates, poll_ma, polls08)

Note that the other arguments for poll_ma are placed after the name of the function as additional arguments to map_df.

It is easier to plot this if the data are tidy, with Obama and McCain as categories of a column candidate.

pop_vote_avg_tidy <-
  pop_vote_avg %>%
  gather(candidate, share, -date, na.rm = TRUE)
head(pop_vote_avg_tidy)
#>         date candidate share
#> 1 2008-01-01     Obama  46.5
#> 2 2008-01-02     Obama  46.5
#> 3 2008-01-03     Obama  46.5
#> 4 2008-01-04     Obama  46.5
#> 5 2008-01-05     Obama  46.5
#> 6 2008-01-06     Obama  46.5

ggplot(pop_vote_avg_tidy, aes(x = date, y = share,
                              colour = fct_reorder2(candidate, date, share))) +
  geom_point() +
  geom_line() +
  scale_colour_manual("Candidate",
                      values = c(Obama = "blue", McCain = "red"))

Challenge: read R for Data Science chapter Iteration and use the function map_df to create the object poll_vote_avg as above instead of a for loop.

The 7-day average is similar to the simple method used by Real Clear Politics. The RCP average is simply the average of all polls in their data for the last seven days. Sites like 538 and the Huffpost Pollster, on the other hand, also use what amounts to averaging polls, but using more sophisticated statistical methods to assign different weights to different polls.

Challenge: Why do we need to use different polls for the popular vote data? Why not simply average all the state polls? What would you have to do? Would the overall popular vote be useful in predicting state-level polling, or vice-versa? How would you use them?

4.2 Linear Regression

4.2.1 Facial Appearance and Election Outcomes

Load the face dataset:

data("face", package = "qss")

Add Democrat and Republican vote shares, and the difference in shares:

face <- mutate(face,
                d.share = d.votes / (d.votes + r.votes),
                r.share = r.votes / (d.votes + r.votes),
                diff.share = d.share - r.share)

Plot facial competence vs. vote share:

ggplot(face, aes(x = d.comp, y = diff.share, colour = w.party)) +
  geom_ref_line(h = 0) +
  geom_point() +
  scale_colour_manual("Winning\nParty",
                      values = c(D = "blue", R = "red")) +
  labs(x = "Competence scores for Democrats",
       y = "Democratic margin in vote share")

4.2.2 Correlation and Scatter Plots

cor(face$d.comp, face$diff.share)
#> [1] 0.433

4.2.3 Least Squares

Run the linear regression

fit <- lm(diff.share ~ d.comp, data = face)
fit
#> 
#> Call:
#> lm(formula = diff.share ~ d.comp, data = face)
#> 
#> Coefficients:
#> (Intercept)       d.comp  
#>      -0.312        0.660

There are many functions to get data out of the lm model.

In addition to these, the broom package provides three functions: glance, tidy, and augment that always return data frames.

The function glance returns a one-row data-frame summary of the model,

glance(fit)
#>   r.squared adj.r.squared sigma statistic  p.value df logLik AIC  BIC
#> 1     0.187          0.18 0.266        27 8.85e-07  2  -10.5  27 35.3
#>   deviance df.residual
#> 1     8.31         117

The function tidy returns a data frame in which each row is a coefficient,

tidy(fit)
#>          term estimate std.error statistic  p.value
#> 1 (Intercept)   -0.312     0.066     -4.73 6.24e-06
#> 2      d.comp    0.660     0.127      5.19 8.85e-07

The function augment returns the original data with fitted values, residuals, and other observation level stats from the model appended to it.

augment(fit) %>% head()
#>   diff.share d.comp .fitted .se.fit  .resid    .hat .sigma  .cooksd
#> 1     0.2101  0.565  0.0606  0.0266  0.1495 0.00996  0.267 0.001600
#> 2     0.1194  0.342 -0.0864  0.0302  0.2059 0.01286  0.267 0.003938
#> 3     0.0499  0.612  0.0922  0.0295 -0.0423 0.01229  0.268 0.000158
#> 4     0.1965  0.542  0.0454  0.0256  0.1511 0.00922  0.267 0.001510
#> 5     0.4958  0.680  0.1370  0.0351  0.3588 0.01737  0.266 0.016307
#> 6    -0.3495  0.321 -0.1006  0.0319 -0.2490 0.01433  0.267 0.006436
#>   .std.resid
#> 1      0.564
#> 2      0.778
#> 3     -0.160
#> 4      0.570
#> 5      1.358
#> 6     -0.941

We can plot the results of the bivariate linear regression as follows:

ggplot() +
  geom_point(data = face, mapping = aes(x = d.comp, y = diff.share)) +
  geom_ref_line(v = mean(face$d.comp)) +
  geom_ref_line(h = mean(face$diff.share)) +
  geom_abline(slope = coef(fit)["d.comp"],
              intercept = coef(fit)["(Intercept)"],
              colour = "red") +
  annotate("text", x = 0.9, y = mean(face$diff.share) + 0.05,
           label = "Mean of Y", color = "blue", vjust = 0) +
  annotate("text", y = -0.9, x = mean(face$d.comp), label = "Mean of X",
           color = "blue", hjust = 0) +
  scale_y_continuous("Democratic margin in vote shares",
                     breaks = seq(-1, 1, by = 0.5), limits = c(-1, 1)) +
  scale_x_continuous("Democratic margin in vote shares",
                     breaks = seq(0, 1, by = 0.2), limits = c(0, 1)) +
  ggtitle("Facial compotence and vote share")

A more general way to plot the predictions of the model against the data is to use the methods described in Ch 23.3.3 of R4DS. Create an evenly spaced grid of values of d.comp, and add predictions of the model to it.

grid <- face %>%
  data_grid(d.comp) %>%
  add_predictions(fit)
head(grid)
#> # A tibble: 6 x 2
#>   d.comp   pred
#>    <dbl>  <dbl>
#> 1 0.0640 -0.270
#> 2 0.0847 -0.256
#> 3 0.0893 -0.253
#> 4 0.115  -0.237
#> 5 0.115  -0.236
#> 6 0.164  -0.204

Now we can plot the regression line and the original data just like any other plot.

ggplot() +
  geom_point(data = face, mapping = aes(x = d.comp, y = diff.share)) +
  geom_line(data = grid, mapping = aes(x = d.comp, y = pred),
            colour = "red")

This method is more complicated than the geom_abline method for a bivariate regression, but will work for more complicated models, while the geom_abline method won’t.

Note that geom_smooth can be used to add a regression line to a data-set.

ggplot(data = face, mapping = aes(x = d.comp, y = diff.share)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

The argument method = "lm" specifies that the function lm is to be used to generate fitted values. It is equivalent to running the regression lm(y ~ x) and plotting the regression line, where y and x are the aesthetics specified by the mappings. The argument se = FALSE tells the function not to plot the confidence interval of the regression (discussed later).

4.2.4 Regression towards the mean

4.2.5 Merging Data Sets in R

See the R for Data Science chapter Relational data.

data("pres12", package = "qss")

To join both data frames

full_join(pres08, pres12, by = "state")

However, since there are duplicate names, .x and .y are appended.

Challenge: What would happen if by = "state" was dropped?

To avoid the duplicate names, or change them, you can rename before merging,

full_join(select(pres08, state, Obama_08 = Obama, McCain_08 = McCain,
                 EV_08 = EV),
          select(pres12, state, Obama_12 = Obama, Romney_12 = Romney,
                 EV_12 = EV),
          by = "state")

or use the suffix argument to full_join

pres <- full_join(pres08, pres12, by = "state", suffix = c("_08", "_12"))
head(pres)
#>   state.name state Obama_08 McCain EV_08 margin Obama_12 Romney EV_12
#> 1    Alabama    AL       39     60     9    -21       38     61     9
#> 2     Alaska    AK       38     59     3    -21       41     55     3
#> 3    Arizona    AZ       45     54    10     -9       45     54    11
#> 4   Arkansas    AR       39     59     6    -20       37     61     6
#> 5 California    CA       61     37    55     24       60     37    55
#> 6   Colorado    CO       54     45     9      9       51     46     9

Challenge Would you consider this data tidy? How would you make it tidy?

The dplyr equivalent functions for cbind is bind_cols.

pres <- pres %>%
  mutate(Obama2008_z = as.numeric(scale(Obama_08)),
         Obama2012_z = as.numeric(scale(Obama_12)))

Likewise, bind_cols concatenates data frames by row.

We need to use the as.numeric function because scale() always returns a matrix. Omitting as.numeric() would not produce an error in the code chunk above, since the columns of a data frame can be matrices, but it would produce errors in some of the following code if it were omitted.

Scatter plot of states with vote shares in 2008 and 2012

ggplot(pres, aes(x = Obama2008_z, y = Obama2012_z, label = state)) +
  geom_abline(colour = "white", size = 2) +
  geom_text() +
  coord_fixed() +
  scale_x_continuous("Obama's standardized vote share in 2008",
                     limits = c(-4, 4)) +
  scale_y_continuous("Obama's standardized vote share in 2012",
                     limits = c(-4, 4))

To calculate the bottom and top quartiles

pres %>%
  filter(Obama2008_z < quantile(Obama2008_z, 0.25)) %>%
  summarise(improve = mean(Obama2012_z > Obama2008_z))
#>   improve
#> 1   0.583

pres %>%
  filter(Obama2008_z < quantile(Obama2008_z, 0.75)) %>%
  summarise(improve = mean(Obama2012_z > Obama2008_z))
#>   improve
#> 1     0.5

Challenge: Why is it important to standardize the vote shares?

4.2.6 Model Fit

data("florida", package = "qss")
fit2 <- lm(Buchanan00 ~ Perot96, data = florida)
fit2
#> 
#> Call:
#> lm(formula = Buchanan00 ~ Perot96, data = florida)
#> 
#> Coefficients:
#> (Intercept)      Perot96  
#>      1.3458       0.0359

Extract \(R ^ 2\) from the results of summary,

summary(fit2)$r.squared
#> [1] 0.513

Alternatively, can get the R squared value from the data frame glance returns:

glance(fit2)
#>   r.squared adj.r.squared sigma statistic  p.value df logLik AIC BIC
#> 1     0.513         0.506   316      68.5 9.47e-12  2   -480 966 972
#>   deviance df.residual
#> 1  6506118          65

We can add predictions and residuals to the original data frame using the modelr functions add_residuals and add_predictions

florida <-
  florida %>%
  add_predictions(fit2) %>%
  add_residuals(fit2)
glimpse(florida)
#> Observations: 67
#> Variables: 9
#> $ county     <chr> "Alachua", "Baker", "Bay", "Bradford", "Brevard", "...
#> $ Clinton96  <int> 40144, 2273, 17020, 3356, 80416, 320736, 1794, 2712...
#> $ Dole96     <int> 25303, 3684, 28290, 4038, 87980, 142834, 1717, 2783...
#> $ Perot96    <int> 8072, 667, 5922, 819, 25249, 38964, 630, 7783, 7244...
#> $ Bush00     <int> 34124, 5610, 38637, 5414, 115185, 177323, 2873, 354...
#> $ Gore00     <int> 47365, 2392, 18850, 3075, 97318, 386561, 2155, 2964...
#> $ Buchanan00 <int> 263, 73, 248, 65, 570, 788, 90, 182, 270, 186, 122,...
#> $ pred       <dbl> 291.3, 25.3, 214.0, 30.8, 908.2, 1400.7, 24.0, 280....
#> $ resid      <dbl> -2.83e+01, 4.77e+01, 3.40e+01, 3.42e+01, -3.38e+02,...

There are now two new columns in florida, pred with the fitted values (predictions), and resid with the residuals.

Use fit2_augment to create a residual plot:

fit2_resid_plot <-
  ggplot(florida, aes(x = pred, y = resid)) +
  geom_ref_line(h = 0) +
  geom_point() +
  labs(x = "Fitted values", y = "residuals")
fit2_resid_plot

Note, we use the function geom_refline to add a reference line at 0.

Let’s add some labels to points, who is that outlier?

fit2_resid_plot +
  geom_label(aes(label = county))

The outlier county is “Palm Beach”

arrange(florida) %>%
  arrange(desc(abs(resid))) %>%
  select(county, resid) %>%
  head()
#>       county resid
#> 1  PalmBeach  2302
#> 2    Broward  -613
#> 3        Lee  -357
#> 4    Brevard  -338
#> 5 Miami-Dade  -329
#> 6   Pinellas  -317

Data without Palm Beach

florida_pb <- filter(florida, county != "PalmBeach")
fit3 <- lm(Buchanan00 ~ Perot96, data = florida_pb)
fit3
#> 
#> Call:
#> lm(formula = Buchanan00 ~ Perot96, data = florida_pb)
#> 
#> Coefficients:
#> (Intercept)      Perot96  
#>     45.8419       0.0244

\(R^2\) or coefficient of determination

glance(fit3)
#>   r.squared adj.r.squared sigma statistic  p.value df logLik AIC BIC
#> 1     0.851         0.849  87.7       366 3.61e-28  2   -388 782 788
#>   deviance df.residual
#> 1   492803          64

florida_pb %>%
  add_residuals(fit3) %>%
  add_predictions(fit3) %>%
  ggplot(aes(x = pred, y = resid)) +
  geom_ref_line(h = 0) +
  geom_point() +
  ylim(-750, 2500) +
  xlim(0, 1500) +
  labs(x = "Fitted values", y = "residuals")

Create predictions for both models using data_grid and gather_predictions:

florida_grid <-
  florida %>%
  data_grid(Perot96) %>%
  gather_predictions(fit2, fit3) %>%
  mutate(model =
           fct_recode(model,
                      "Regression\n with Palm Beach" = "fit2",
                      "Regression\n without Palm Beach" = "fit3"))

Note this is an example of using non-syntactic column names in a tibble, as discussed in Chapter 10 of R for data science.

ggplot() +
  geom_point(data = florida, mapping = aes(x = Perot96, y = Buchanan00)) +
  geom_line(data = florida_grid,
             mapping = aes(x = Perot96, y = pred,
                           colour = model)) +
  geom_label(data = filter(florida, county == "PalmBeach"),
             mapping = aes(x = Perot96, y = Buchanan00, label = county),
                           vjust = "top", hjust = "right") +
  geom_text(data = tibble(label = unique(florida_grid$model),
                           x = c(20000, 31000),
                           y = c(1000, 300)),
             mapping = aes(x = x, y = y, label = label, colour = label)) +
  labs(x = "Perot's Vote in 1996", y = "Buchanan's Votes in 1996") +
  theme(legend.position = "none")

See Graphics for communication in R for Data Science on labels and annotations in plots.

4.3 Regression and Causation

4.3.1 Randomized Experiments

Load data

data("women", package = "qss")

Proportion of female politicians in reserved GP vs. unreserved GP

women %>%
  group_by(reserved) %>%
  summarise(prop_female = mean(female))
#> # A tibble: 2 x 2
#>   reserved prop_female
#>      <int>       <dbl>
#> 1        0      0.0748
#> 2        1      1.00

The diff in means estimator:

# drinking water facilities

# irrigation facilities
mean(women$irrigation[women$reserved == 1]) -
    mean(women$irrigation[women$reserved == 0])
#> [1] -0.369

Mean values of irrigation and water in reserved and non-reserved districts.

women %>%
  group_by(reserved) %>%
  summarise(irrigation = mean(irrigation),
            water = mean(water))
#> # A tibble: 2 x 3
#>   reserved irrigation water
#>      <int>      <dbl> <dbl>
#> 1        0       3.39  14.7
#> 2        1       3.02  24.0

The difference between the two groups can be calculated with the function diff, which calculates the difference between subsequent observations. This works as long as we are careful about which group is first or second.

women %>%
  group_by(reserved) %>%
  summarise(irrigation = mean(irrigation),
            water = mean(water)) %>%
  summarise(diff_irrigation = diff(irrigation),
            diff_water = diff(water))
#> # A tibble: 1 x 2
#>   diff_irrigation diff_water
#>             <dbl>      <dbl>
#> 1          -0.369       9.25

The other way uses tidyr spread and gather,

women %>%
  group_by(reserved) %>%
  summarise(irrigation = mean(irrigation),
            water = mean(water)) %>%
  gather(variable, value, -reserved) %>%
  spread(reserved, value) %>%
  mutate(diff = `1` - `0`)
#> # A tibble: 2 x 4
#>   variable     `0`   `1`   diff
#>   <chr>      <dbl> <dbl>  <dbl>
#> 1 irrigation  3.39  3.02 -0.369
#> 2 water      14.7  24.0   9.25

Now each row is an outcome variable of interest, and there are columns for the treatment (1) and control (0) groups, and the difference (diff).

lm(water ~ reserved, data = women)
#> 
#> Call:
#> lm(formula = water ~ reserved, data = women)
#> 
#> Coefficients:
#> (Intercept)     reserved  
#>       14.74         9.25

lm(irrigation ~ reserved, data = women)
#> 
#> Call:
#> lm(formula = irrigation ~ reserved, data = women)
#> 
#> Coefficients:
#> (Intercept)     reserved  
#>       3.388       -0.369

4.3.2 Regression with multiple predictors

data("social", package = "qss")
glimpse(social)
#> Observations: 305,866
#> Variables: 6
#> $ sex         <chr> "male", "female", "male", "female", "female", "mal...
#> $ yearofbirth <int> 1941, 1947, 1951, 1950, 1982, 1981, 1959, 1956, 19...
#> $ primary2004 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,...
#> $ messages    <chr> "Civic Duty", "Civic Duty", "Hawthorne", "Hawthorn...
#> $ primary2006 <int> 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1,...
#> $ hhsize      <int> 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1, 2, 2, 1, 2, 2, 1,...
levels(social$messages)
#> NULL
fit <- lm(primary2006 ~ messages, data = social)
fit
#> 
#> Call:
#> lm(formula = primary2006 ~ messages, data = social)
#> 
#> Coefficients:
#>       (Intercept)    messagesControl  messagesHawthorne  
#>           0.31454           -0.01790            0.00784  
#> messagesNeighbors  
#>           0.06341

Create indicator variables for each message:

social <-
  social %>%
  mutate(Control = as.integer(messages == "Control"),
         Hawthorne = as.integer(messages == "Hawthorne"),
         Neighbors = as.integer(messages == "Neighbors"))

alternatively, create these using a for loop. This is easier to understand and less prone to typos:

for (i in unique(social$messages)) {
  social[[i]] <- as.integer(social[["messages"]] == i)
}

We created a variable for each level of messages even though we will exclude one of them.

lm(primary2006 ~ Control + Hawthorne + Neighbors, data = social)
#> 
#> Call:
#> lm(formula = primary2006 ~ Control + Hawthorne + Neighbors, data = social)
#> 
#> Coefficients:
#> (Intercept)      Control    Hawthorne    Neighbors  
#>     0.31454     -0.01790      0.00784      0.06341

Create predictions for each unique value of messages

unique_messages <-
  data_grid(social, messages) %>%
  add_predictions(fit)
unique_messages
#> # A tibble: 4 x 2
#>   messages    pred
#>   <chr>      <dbl>
#> 1 Civic Duty 0.315
#> 2 Control    0.297
#> 3 Hawthorne  0.322
#> 4 Neighbors  0.378

Compare to the sample averages

social %>%
  group_by(messages) %>%
  summarise(mean(primary2006))
#> # A tibble: 4 x 2
#>   messages   `mean(primary2006)`
#>   <chr>                    <dbl>
#> 1 Civic Duty               0.315
#> 2 Control                  0.297
#> 3 Hawthorne                0.322
#> 4 Neighbors                0.378

Linear regression without intercept.

fit_noint <- lm(primary2006 ~ -1 + messages, data = social)
fit_noint
#> 
#> Call:
#> lm(formula = primary2006 ~ -1 + messages, data = social)
#> 
#> Coefficients:
#> messagesCivic Duty     messagesControl   messagesHawthorne  
#>              0.315               0.297               0.322  
#>  messagesNeighbors  
#>              0.378

Calculating the regression average effect is also easier if we make the control group the first level so all regression coefficients are comparisons to it. Use fct_relevel to make “Control”

fit_control <-
  mutate(social, messages = fct_relevel(messages, "Control")) %>%
  lm(primary2006 ~ messages, data = .)
fit_control
#> 
#> Call:
#> lm(formula = primary2006 ~ messages, data = .)
#> 
#> Coefficients:
#>        (Intercept)  messagesCivic Duty   messagesHawthorne  
#>             0.2966              0.0179              0.0257  
#>  messagesNeighbors  
#>             0.0813

Difference in means

social %>%
  group_by(messages) %>%
  summarise(primary2006 = mean(primary2006)) %>%
  mutate(Control = primary2006[messages == "Control"],
         diff = primary2006 - Control)
#> # A tibble: 4 x 4
#>   messages   primary2006 Control   diff
#>   <chr>            <dbl>   <dbl>  <dbl>
#> 1 Civic Duty       0.315   0.297 0.0179
#> 2 Control          0.297   0.297 0     
#> 3 Hawthorne        0.322   0.297 0.0257
#> 4 Neighbors        0.378   0.297 0.0813

Adjusted R-squared is included in the output of broom::glance()

glance(fit)
#>   r.squared adj.r.squared sigma statistic   p.value df  logLik    AIC
#> 1   0.00328       0.00327 0.463       336 1.06e-217  4 -198247 396504
#>      BIC deviance df.residual
#> 1 396557    65468      305862
glance(fit)[["adj.r.squared"]]
#> [1] 0.00327

4.3.3 Heterogeneous Treatment Effects

Average treatment effect (ate) among those who voted in 2004 primary

ate <-
  social %>%
  group_by(primary2004, messages) %>%
  summarise(primary2006 = mean(primary2006)) %>%
  spread(messages, primary2006) %>%
  mutate(ate_Neighbors = Neighbors - Control) %>%
  select(primary2004, Neighbors, Control, ate_Neighbors)
ate
#> # A tibble: 2 x 4
#> # Groups:   primary2004 [2]
#>   primary2004 Neighbors Control ate_Neighbors
#>         <int>     <dbl>   <dbl>         <dbl>
#> 1           0     0.306   0.237        0.0693
#> 2           1     0.482   0.386        0.0965

Difference in ATE in 2004 voters and non-voters

diff(ate$ate_Neighbors)
#> [1] 0.0272

social_neighbor <- social %>%
  filter( (messages == "Control") | (messages == "Neighbors"))

fit_int <- lm(primary2006 ~ primary2004 + messages + primary2004:messages,
              data = social_neighbor)
fit_int
#> 
#> Call:
#> lm(formula = primary2006 ~ primary2004 + messages + primary2004:messages, 
#>     data = social_neighbor)
#> 
#> Coefficients:
#>                   (Intercept)                    primary2004  
#>                        0.2371                         0.1487  
#>             messagesNeighbors  primary2004:messagesNeighbors  
#>                        0.0693                         0.0272

lm(primary2006 ~ primary2004 * messages, data = social_neighbor)
#> 
#> Call:
#> lm(formula = primary2006 ~ primary2004 * messages, data = social_neighbor)
#> 
#> Coefficients:
#>                   (Intercept)                    primary2004  
#>                        0.2371                         0.1487  
#>             messagesNeighbors  primary2004:messagesNeighbors  
#>                        0.0693                         0.0272

social_neighbor <-
  social_neighbor %>%
  mutate(age = 2008 - yearofbirth)

summary(social_neighbor$age)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    22.0    43.0    52.0    51.8    61.0   108.0

fit.age <- lm(primary2006 ~ age * messages, data = social_neighbor)
fit.age
#> 
#> Call:
#> lm(formula = primary2006 ~ age * messages, data = social_neighbor)
#> 
#> Coefficients:
#>           (Intercept)                    age      messagesNeighbors  
#>              0.089477               0.003998               0.048573  
#> age:messagesNeighbors  
#>              0.000628

Calculate average treatment effects

ate.age <-
  crossing(age = seq(from = 25, to = 85, by = 20),
         messages = c("Neighbors", "Control")) %>%
  add_predictions(fit.age) %>%
  spread(messages, pred) %>%
  mutate(diff = Neighbors - Control)
ate.age
#> # A tibble: 4 x 4
#>     age Control Neighbors   diff
#>   <dbl>   <dbl>     <dbl>  <dbl>
#> 1    25   0.189     0.254 0.0643
#> 2    45   0.269     0.346 0.0768
#> 3    65   0.349     0.439 0.0894
#> 4    85   0.429     0.531 0.102

You can use poly function to calculate polynomials instead of adding each term, age + I(age ^ 2). Though note that the coefficients will be be different since by default poly calculates orthogonal polynomials instead of the natural (raw) polynomials. However, you really shouldn’t interpret the coefficients directly anyways, so this should matter.

fit.age2 <- lm(primary2006 ~ poly(age, 2) * messages,
               data = social_neighbor)
fit.age2
#> 
#> Call:
#> lm(formula = primary2006 ~ poly(age, 2) * messages, data = social_neighbor)
#> 
#> Coefficients:
#>                     (Intercept)                    poly(age, 2)1  
#>                          0.2966                          27.6665  
#>                   poly(age, 2)2                messagesNeighbors  
#>                        -10.2832                           0.0816  
#> poly(age, 2)1:messagesNeighbors  poly(age, 2)2:messagesNeighbors  
#>                          4.5820                          -5.5124

Create a data frame of combinations of ages and messages using data_grid, which means that we only need to specify the variables, and not the specific values,

y.hat <-
  data_grid(social_neighbor, age, messages) %>%
  add_predictions(fit.age2)

ggplot(y.hat, aes(x = age, y = pred,
                  colour = str_c(messages, " condition"))) +
  geom_line() +
  labs(colour = "", y = "Predicted turnout rates") +
  theme(legend.position = "bottom")

y.hat %>%
  spread(messages, pred) %>%
  mutate(ate = Neighbors - Control) %>%
  filter(age > 20, age < 90) %>%
  ggplot(aes(x = age, y = ate)) +
  geom_line() +
  labs(y = "Estimated average treatment effect",
       x = "Age") +
  ylim(0, 0.1)

4.3.4 Regression Discontinuity Design

data("MPs", package = "qss")

MPs_labour <- filter(MPs, party == "labour")
MPs_tory <- filter(MPs, party == "tory")

labour_fit1 <- lm(ln.net ~ margin,
                 data = filter(MPs_labour, margin < 0))
labour_fit2 <- lm(ln.net ~ margin, MPs_labour, margin > 0)

tory_fit1 <- lm(ln.net ~ margin,
                data = filter(MPs_tory, margin < 0))
tory_fit2 <- lm(ln.net ~ margin, data = filter(MPs_tory, margin > 0))

Use to generate a grid for predictions.

y1_labour <-
  filter(MPs_labour, margin < 0) %>%
  data_grid(margin) %>%
  add_predictions(labour_fit1)
y2_labour <-
  filter(MPs_labour, margin > 0) %>%
  data_grid(margin) %>%
  add_predictions(labour_fit2)

y1_tory <-
  filter(MPs_tory, margin < 0) %>%
  data_grid(margin) %>%
  add_predictions(tory_fit1)

y2_tory <-
  filter(MPs_tory, margin > 0) %>%
  data_grid(margin) %>%
  add_predictions(tory_fit2)

Tory politicians

ggplot() +
  geom_ref_line(v = 0) +
  geom_point(data = MPs_tory,
             mapping = aes(x = margin, y = ln.net)) +
  geom_line(data = y1_tory,
            mapping = aes(x = margin, y = pred), colour = "red", size = 1.5) +
  geom_line(data = y2_tory,
            mapping = aes(x = margin, y = pred), colour = "red", size = 1.5) +
  labs(x = "margin of victory", y = "log net wealth at death",
       title = "labour")

We can actually produce this plot easily without running the regressions, by using geom_smooth:

ggplot(mutate(MPs, winner = (margin > 0)),
       aes(x = margin, y = ln.net)) +
  geom_ref_line(v = 0) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE, mapping = aes(group = winner)) +
  facet_grid(party ~ .) +
  labs(x = "margin of victory", y = "log net wealth at death")

In the previous code, I didn’t directly compute the the average net wealth at 0, so I’ll need to do that here. I’ll use gather_predictions to add predictions for multiple models:

spread_predictions(data_frame(margin = 0),
                   tory_fit1, tory_fit2) %>%
  mutate(rd_est = tory_fit2 - tory_fit1)
#> # A tibble: 1 x 4
#>   margin tory_fit1 tory_fit2 rd_est
#>    <dbl>     <dbl>     <dbl>  <dbl>
#> 1      0      12.5      13.2  0.650

tory_fit3 <- lm(margin.pre ~ margin, data = filter(MPs_tory, margin < 0))
tory_fit4 <- lm(margin.pre ~ margin, data = filter(MPs_tory, margin > 0))

(filter(tidy(tory_fit4), term == "(Intercept)")[["estimate"]] -
 filter(tidy(tory_fit3), term == "(Intercept)")[["estimate"]])
#> [1] -0.0173