4 Prediction
Prerequisites
library("tidyverse")
library("lubridate")
library("stringr")
library("forcats")
The packages modelr and broom are used to wrangle the results of linear regressions,
library("broom")
library("modelr")
4.1 Predicting Election Outcomes
4.1.1 Loops in R
RStudio provides many features to help debugging, which will be useful in for loops and function: see this article for an example.
values <- c(2, 3, 6)
n <- length(values)
results <- rep(NA, n)
for (i in 1:n) {
results[[i]] <- values[[i]] * 2
cat(values[[i]], "times 2 is equal to", results[[i]], "\n")
}
#> 2 times 2 is equal to 4
#> 3 times 2 is equal to 6
#> 6 times 2 is equal to 12
Note that the above code uses the for loop for pedagogical purposes only, this could have simply been written
results <- values * 2
results
#> [1] 4 6 12
In general, avoid using for loops when there is a vectorized function.
But sticking with the for loop, there are several things that could be improved.
Avoid using the idiom 1:n
in for loops. To see why, look what happens when values are empty:
values <- c()
n <- length(values)
results <- rep(NA, n)
for (i in 1:n) {
cat("i = ", i, "\n")
results[i] <- values[i] * 2
cat(values[i], "times 2 is equal to", results[i], "\n")
}
#> i = 1
#> Error in results[i] <- values[i] * 2: replacement has length zero
Instead of not running a loop, as you would expect, it runs two loops, where i = 1
, then i = 0
. This edge case occurs more than you may think, especially if you are writing functions where you don’t know the length of the vector is ex ante.
The way to avoid this is to use either rdoc("base", "seq_len")
or rdoc("base", "seq_along")
, which will handle 0-length vectors correctly.
values <- c()
n <- length(values)
results <- rep(NA, n)
for (i in seq_along(values)) {
results[i] <- values[i] * 2
}
print(results)
#> logical(0)
or
values <- c()
n <- length(values)
results <- rep(NA, n)
for (i in seq_len(n)) {
results[i] <- values[i] * 2
}
print(results)
#> logical(0)
Also, note that the the result is logical(0)
. That’s because the NA
missing value has class logical, and thus rep(NA, ...)
returns a logical vector. It is better style to initialize the vector with the same data type that you will be using,
results <- rep(NA_real_, length(values))
results
#> numeric(0)
class(results)
#> [1] "numeric"
Often loops can be rewritten to use a map function. Read the R for Data Science chapter Iteration before proceeding.
To do so, we first write a function that will be applied to each element of the vector. When converting from a for
loop to a function, this is usually simply the body of the for
loop, though you may need to add arguments for any variables defined outside the body of the for loop. In this case,
mult_by_two <- function(x) {
x * 2
}
We can now test that this function works on different values:
mult_by_two(0)
#> [1] 0
mult_by_two(2.5)
#> [1] 5
mult_by_two(-3)
#> [1] -6
At this point, we could replace the body of the for
loop with this function:
values <- c(2, 4, 6)
n <- length(values)
results <- rep(NA, n)
for (i in seq_len(n)) {
results[i] <- mult_by_two(values[i])
}
print(results)
#> [1] 4 8 12
This can be useful if the body of a for loop is many lines long.
However, this loop is still unwieldy code. We have to remember to define an empty vector results
that is the same size as values
to hold the results, and then correctly loop over all the values. We already saw how these steps have possibilities for errors. Functionals like map
, apply a function to each element of a vector.
results <- map(values, mult_by_two)
results
#> [[1]]
#> [1] 4
#>
#> [[2]]
#> [1] 8
#>
#> [[3]]
#> [1] 12
The values of each element are correct, but map
returns a list vector, not a numeric vector like we may have been expecting. If we want a numeric vector, use map_dbl
,
results <- map_dbl(values, mult_by_two)
results
#> [1] 4 8 12
Also, instead of explicitly defining a function, like mult_by_two
, we could have instead used an anonymous function with the functional. An anonymous function is a function that is not assigned to a name.
results <- map_dbl(values, function(x) x * 2)
results
#> [1] 4 8 12
The various purrr functions also will interpret formulas as functions where .x
and .y
are interpreted as (up to) two arguments.
results <- map_dbl(values, ~ .x * 2)
results
#> [1] 4 8 12
This is for parsimony and convenience; in the background, these functions are creating anonymous functions from the given formula.
QSS discusses several debugging strategies. The functional approach lends itself to easier debugging because the function can be tested with input values independently of the loop.
4.1.2 General Conditional Statements in R
See the R for Data Science section Conditional Execution for a more complete discussion of conditional execution.
If you are using conditional statements to assign values for data frame, see the dplyr functions if_else, recode, and case_when
The following code which uses a for loop,
values <- 1:5
n <- length(values)
results <- rep(NA_real_, n)
for (i in seq_len(n)) {
x <- values[i]
r <- x %% 2
if (r == 0) {
cat(x, "is even and I will perform addition", x, " + ", x, "\n")
results[i] <- x + x
} else {
cat(x, "is even and I will perform multiplication", x, " * ", x, "\n")
results[i] <- x * x
}
}
#> 1 is even and I will perform multiplication 1 * 1
#> 2 is even and I will perform addition 2 + 2
#> 3 is even and I will perform multiplication 3 * 3
#> 4 is even and I will perform addition 4 + 4
#> 5 is even and I will perform multiplication 5 * 5
results
#> [1] 1 4 9 8 25
could be rewritten to use if_else
,
if_else(values %% 2 == 0, values + values, values * values)
#> [1] 1 4 9 8 25
or using the map_dbl
functional with a named function,
myfunc <- function(x) {
if (x %% 2 == 0) {
x + x
} else {
x * x
}
}
map_dbl(values, myfunc)
#> [1] 1 4 9 8 25
or map_dbl
with an anonymous function,
map_dbl(values, function(x) {
if (x %% 2 == 0) {
x + x
} else {
x * x
}
})
#> [1] 1 4 9 8 25
4.1.3 Poll Predictions
Load the election polls by state for the 2008 US Presidential election,
data("polls08", package = "qss")
glimpse(polls08)
#> Observations: 1,332
#> Variables: 5
#> $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",...
#> $ Pollster <chr> "SurveyUSA-2", "Capital Survey-2", "SurveyUSA-2", "Ca...
#> $ Obama <int> 36, 34, 35, 35, 39, 34, 36, 25, 35, 34, 37, 36, 36, 3...
#> $ McCain <int> 61, 54, 62, 55, 60, 64, 58, 52, 55, 47, 55, 51, 49, 5...
#> $ middate <date> 2008-10-27, 2008-10-15, 2008-10-08, 2008-10-06, 2008...
and the election results,
data("pres08", package = "qss")
glimpse(pres08)
#> Observations: 51
#> Variables: 5
#> $ state.name <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califo...
#> $ state <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE...
#> $ Obama <int> 39, 38, 45, 39, 61, 54, 61, 92, 62, 51, 47, 72, 36,...
#> $ McCain <int> 60, 59, 54, 59, 37, 45, 38, 7, 37, 48, 52, 27, 62, ...
#> $ EV <int> 9, 3, 10, 6, 55, 9, 7, 3, 3, 27, 15, 4, 4, 21, 11, ...
Compute Obama’s margin in polls and final election
polls08 <-
polls08 %>% mutate(margin = Obama - McCain)
pres08 <-
pres08 %>% mutate(margin = Obama - McCain)
To work with dates, the R package lubridate makes wrangling them much easier. See the R for Data Science chapter Dates and Times.
The function ymd will convert character strings like year-month-day
and more into dates, as long as the order is (year, month, day). See dmy, mdy, and others for other ways to convert strings to dates.
x <- ymd("2008-11-04")
y <- ymd("2008/9/1")
x - y
#> Time difference of 64 days
However, note that in polls08
, the date middate
is already a date
object,
class(polls08$middate)
#> [1] "Date"
The function read_csv by default will check character vectors to see if they have patterns that appear to be dates, and if so, will parse those columns as dates.
We’ll create a variable for election day
ELECTION_DAY <- ymd("2008-11-04")
and add a new column to poll08
with the days to the election
polls08 <- mutate(polls08, ELECTION_DAY - middate)
Although the code in the chapter uses a for
loop, there is no reason to do so. We can accomplish the same task by merging the election results data to the polling data by state
.
polls_w_results <- left_join(polls08,
select(pres08, state, elec_margin = margin),
by = "state") %>%
mutate(error = elec_margin - margin)
glimpse(polls_w_results)
#> Observations: 1,332
#> Variables: 9
#> $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "...
#> $ Pollster <chr> "SurveyUSA-2", "Capital Survey-2", "S...
#> $ Obama <int> 36, 34, 35, 35, 39, 34, 36, 25, 35, 3...
#> $ McCain <int> 61, 54, 62, 55, 60, 64, 58, 52, 55, 4...
#> $ middate <date> 2008-10-27, 2008-10-15, 2008-10-08, ...
#> $ margin <int> -25, -20, -27, -20, -21, -30, -22, -2...
#> $ `ELECTION_DAY - middate` <time> 8 days, 20 days, 27 days, 29 days, 4...
#> $ elec_margin <int> -21, -21, -21, -21, -21, -21, -21, -2...
#> $ error <int> 4, -1, 6, -1, 0, 9, 1, 6, -1, -8, -3,...
To get the last poll in each state, arrange and filter on middate
last_polls <-
polls_w_results %>%
arrange(state, desc(middate)) %>%
group_by(state) %>%
slice(1)
last_polls
#> # A tibble: 51 x 9
#> # Groups: state [51]
#> state Pollster Obama McCain middate margin `ELECTION_DAY - midd…
#> <chr> <chr> <int> <int> <date> <int> <time>
#> 1 AK Research 200… 39 58 2008-10-29 -19 6
#> 2 AL SurveyUSA-2 36 61 2008-10-27 -25 8
#> 3 AR ARG-4 44 51 2008-10-29 - 7 6
#> 4 AZ ARG-3 46 50 2008-10-29 - 4 6
#> 5 CA SurveyUSA-3 60 36 2008-10-30 24 5
#> 6 CO ARG-3 52 45 2008-10-29 7 6
#> # ... with 45 more rows, and 2 more variables: elec_margin <int>,
#> # error <int>
Challenge: Instead of using the last poll, use the average of polls in the last week? Last month? How do the margins on the polls change over the election period?
To simplify things for later, let’s define a function rmse
which calculates the root mean squared error, as defined in the book. See the R for Data Science chapter Functions for more on writing functions.
rmse <- function(actual, pred) {
sqrt(mean( (actual - pred) ^ 2))
}
Now we can use rmse()
to calculate the RMSE for all the final polls:
rmse(last_polls$margin, last_polls$elec_margin)
#> [1] 5.88
Or since we already have a variable error
,
sqrt(mean(last_polls$error ^ 2))
#> [1] 5.88
The mean prediction error is
mean(last_polls$error)
#> [1] 1.08
This is slightly different than what is in the book due to the difference in the poll used as the final poll; many states have many polls on the last day.
I’ll choose bin widths of 1%, since that is fairly interpretable:
ggplot(last_polls, aes(x = error)) +
geom_histogram(binwidth = 1, boundary = 0)
The text uses bin widths of 5%:
ggplot(last_polls, aes(x = error)) +
geom_histogram(binwidth = 5, boundary = 0)
Challenge: What other ways could you visualize the results? How would you show all states? What about plotting the absolute or squared errors instead of the errors?
Challenge: What happens to prediction error if you average polls? Consider averaging back over time? What happens if you take the averages of the state poll average and average of all polls - does that improve prediction?
To create a scatter plots using the state abbreviations instead of points use geom_text instead of geom_point.
ggplot(last_polls, aes(x = margin, y = elec_margin, label = state)) +
geom_abline(color = "white", size = 2) +
geom_hline(yintercept = 0, color = "gray", size = 2) +
geom_vline(xintercept = 0, color = "gray", size = 2) +
geom_text() +
coord_fixed() +
labs(x = "Poll Results", y = "Actual Election Results")
We can create a confusion matrix as follows. Create a new column classification
which shows how the poll’s classification was related to the actual election outcome (“true positive”, “false positive”, “true negative”, “false negative”). If there were two outcomes, then we would use the function. But with more than two outcomes, it is easier to use the dplyr function .
last_polls <-
last_polls %>%
ungroup() %>%
mutate(classification =
case_when(
(.$margin > 0 & .$elec_margin > 0) ~ "true positive",
(.$margin > 0 & .$elec_margin < 0) ~ "false positive",
(.$margin < 0 & .$elec_margin < 0) ~ "true negative",
(.$margin < 0 & .$elec_margin > 0) ~ "false negative"
))
You need to use .
to refer to the data frame when using case_when()
within mutate()
. Also, we needed to first use in order to remove the grouping variable so mutate()
will work.
Now simply count the number of polls in each category of classification
:
last_polls %>%
group_by(classification) %>%
count()
#> # A tibble: 4 x 2
#> # Groups: classification [4]
#> classification n
#> <chr> <int>
#> 1 false negative 2
#> 2 false positive 1
#> 3 true negative 21
#> 4 true positive 27
Which states were incorrectly predicted by the polls, and what was their margins?
last_polls %>%
filter(classification %in% c("false positive", "false negative")) %>%
select(state, margin, elec_margin, classification) %>%
arrange(desc(elec_margin))
#> # A tibble: 3 x 4
#> state margin elec_margin classification
#> <chr> <int> <int> <chr>
#> 1 IN -5 1 false negative
#> 2 NC -1 1 false negative
#> 3 MO 1 -1 false positive
What was the difference in the poll prediction of electoral votes and actual electoral votes? We hadn’t included the variable EV
when we first merged, but that’s no problem, we’ll just merge again in order to grab that variable:
last_polls %>%
left_join(select(pres08, state, EV), by = "state") %>%
summarise(EV_pred = sum( (margin > 0) * EV),
EV_actual = sum( (elec_margin > 0) * EV))
#> # A tibble: 1 x 2
#> EV_pred EV_actual
#> <int> <int>
#> 1 349 364
data("pollsUS08", package = "qss")
pollsUS08 <- mutate(pollsUS08, DaysToElection = ELECTION_DAY - middate)
We’ll produce the seven-day averages slightly differently than the method used in the text. For all dates in the data, we’ll calculate the moving average. The code presented in QSS uses a for loop similar to the following:
all_dates <- seq(min(polls08$middate), ELECTION_DAY, by = "days")
# Number of poll days to use
POLL_DAYS <- 7
pop_vote_avg <- vector(length(all_dates), mode = "list")
for (i in seq_along(all_dates)) {
date <- all_dates[i]
# summarise the seven day
week_data <-
filter(polls08,
as.integer(middate - date) <= 0,
as.integer(middate - date) > - POLL_DAYS) %>%
summarise(Obama = mean(Obama, na.rm = TRUE),
McCain = mean(McCain, na.rm = TRUE))
# add date for the observation
week_data$date <- date
pop_vote_avg[[i]] <- week_data
}
pop_vote_avg <- bind_rows(pop_vote_avg)
Write a function which takes a date
, and calculates the days
(set the default to 7 days) moving average using the dataset .data
:
poll_ma <- function(date, .data, days = 7) {
filter(.data,
as.integer(middate - date) <= 0,
as.integer(middate - date) > - !!days) %>%
summarise(Obama = mean(Obama, na.rm = TRUE),
McCain = mean(McCain, na.rm = TRUE)) %>%
mutate(date = !!date)
}
The code above uses !!
. This tells filter
that days
refers to a variable days
in the calling environment, and not a column named days
in the data frame. In this case, there wouldn’t be any ambiguities since there is not a column named days
, but in general there can be ambiguities in the dplyr functions as to whether the names refer to columns in the data frame or variables in the environment calling the function. Read Programming with dplyr for an in-depth discussion of this.
This returns a one row data frame with the moving average for McCain and Obama on Nov 1, 2008.
poll_ma(as.Date("2008-11-01"), polls08)
#> Obama McCain date
#> 1 49.1 45.4 2008-11-01
Since we made days
an argument to the function we could easily change the code to calculate other moving averages,
poll_ma(as.Date("2008-11-01"), polls08, days = 3)
#> Obama McCain date
#> 1 50.6 45.4 2008-11-01
Now use a functional to execute that function with all dates for which we want moving averages. The function poll_ma
returns a data frame, and our ideal output is a data frame that stacks those data frames row-wise. So we will use the map_df
function,
map_df(all_dates, poll_ma, polls08)
Note that the other arguments for poll_ma
are placed after the name of the function as additional arguments to map_df
.
It is easier to plot this if the data are tidy, with Obama
and McCain
as categories of a column candidate
.
pop_vote_avg_tidy <-
pop_vote_avg %>%
gather(candidate, share, -date, na.rm = TRUE)
head(pop_vote_avg_tidy)
#> date candidate share
#> 1 2008-01-01 Obama 46.5
#> 2 2008-01-02 Obama 46.5
#> 3 2008-01-03 Obama 46.5
#> 4 2008-01-04 Obama 46.5
#> 5 2008-01-05 Obama 46.5
#> 6 2008-01-06 Obama 46.5
ggplot(pop_vote_avg_tidy, aes(x = date, y = share,
colour = fct_reorder2(candidate, date, share))) +
geom_point() +
geom_line() +
scale_colour_manual("Candidate",
values = c(Obama = "blue", McCain = "red"))
Challenge: read R for Data Science chapter Iteration and use the function map_df to create the object poll_vote_avg
as above instead of a for loop.
The 7-day average is similar to the simple method used by Real Clear Politics. The RCP average is simply the average of all polls in their data for the last seven days. Sites like 538 and the Huffpost Pollster, on the other hand, also use what amounts to averaging polls, but using more sophisticated statistical methods to assign different weights to different polls.
Challenge: Why do we need to use different polls for the popular vote data? Why not simply average all the state polls? What would you have to do? Would the overall popular vote be useful in predicting state-level polling, or vice-versa? How would you use them?
4.2 Linear Regression
4.2.1 Facial Appearance and Election Outcomes
Load the face
dataset:
data("face", package = "qss")
Add Democrat and Republican vote shares, and the difference in shares:
face <- mutate(face,
d.share = d.votes / (d.votes + r.votes),
r.share = r.votes / (d.votes + r.votes),
diff.share = d.share - r.share)
Plot facial competence vs. vote share:
ggplot(face, aes(x = d.comp, y = diff.share, colour = w.party)) +
geom_ref_line(h = 0) +
geom_point() +
scale_colour_manual("Winning\nParty",
values = c(D = "blue", R = "red")) +
labs(x = "Competence scores for Democrats",
y = "Democratic margin in vote share")
4.2.2 Correlation and Scatter Plots
cor(face$d.comp, face$diff.share)
#> [1] 0.433
4.2.3 Least Squares
Run the linear regression
fit <- lm(diff.share ~ d.comp, data = face)
fit
#>
#> Call:
#> lm(formula = diff.share ~ d.comp, data = face)
#>
#> Coefficients:
#> (Intercept) d.comp
#> -0.312 0.660
There are many functions to get data out of the lm
model.
In addition to these, the broom package provides three functions: glance
, tidy
, and augment
that always return data frames.
The function glance returns a one-row data-frame summary of the model,
glance(fit)
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> 1 0.187 0.18 0.266 27 8.85e-07 2 -10.5 27 35.3
#> deviance df.residual
#> 1 8.31 117
The function tidy returns a data frame in which each row is a coefficient,
tidy(fit)
#> term estimate std.error statistic p.value
#> 1 (Intercept) -0.312 0.066 -4.73 6.24e-06
#> 2 d.comp 0.660 0.127 5.19 8.85e-07
The function augment returns the original data with fitted values, residuals, and other observation level stats from the model appended to it.
augment(fit) %>% head()
#> diff.share d.comp .fitted .se.fit .resid .hat .sigma .cooksd
#> 1 0.2101 0.565 0.0606 0.0266 0.1495 0.00996 0.267 0.001600
#> 2 0.1194 0.342 -0.0864 0.0302 0.2059 0.01286 0.267 0.003938
#> 3 0.0499 0.612 0.0922 0.0295 -0.0423 0.01229 0.268 0.000158
#> 4 0.1965 0.542 0.0454 0.0256 0.1511 0.00922 0.267 0.001510
#> 5 0.4958 0.680 0.1370 0.0351 0.3588 0.01737 0.266 0.016307
#> 6 -0.3495 0.321 -0.1006 0.0319 -0.2490 0.01433 0.267 0.006436
#> .std.resid
#> 1 0.564
#> 2 0.778
#> 3 -0.160
#> 4 0.570
#> 5 1.358
#> 6 -0.941
We can plot the results of the bivariate linear regression as follows:
ggplot() +
geom_point(data = face, mapping = aes(x = d.comp, y = diff.share)) +
geom_ref_line(v = mean(face$d.comp)) +
geom_ref_line(h = mean(face$diff.share)) +
geom_abline(slope = coef(fit)["d.comp"],
intercept = coef(fit)["(Intercept)"],
colour = "red") +
annotate("text", x = 0.9, y = mean(face$diff.share) + 0.05,
label = "Mean of Y", color = "blue", vjust = 0) +
annotate("text", y = -0.9, x = mean(face$d.comp), label = "Mean of X",
color = "blue", hjust = 0) +
scale_y_continuous("Democratic margin in vote shares",
breaks = seq(-1, 1, by = 0.5), limits = c(-1, 1)) +
scale_x_continuous("Democratic margin in vote shares",
breaks = seq(0, 1, by = 0.2), limits = c(0, 1)) +
ggtitle("Facial compotence and vote share")
A more general way to plot the predictions of the model against the data is to use the methods described in Ch 23.3.3 of R4DS. Create an evenly spaced grid of values of d.comp
, and add predictions of the model to it.
grid <- face %>%
data_grid(d.comp) %>%
add_predictions(fit)
head(grid)
#> # A tibble: 6 x 2
#> d.comp pred
#> <dbl> <dbl>
#> 1 0.0640 -0.270
#> 2 0.0847 -0.256
#> 3 0.0893 -0.253
#> 4 0.115 -0.237
#> 5 0.115 -0.236
#> 6 0.164 -0.204
Now we can plot the regression line and the original data just like any other plot.
ggplot() +
geom_point(data = face, mapping = aes(x = d.comp, y = diff.share)) +
geom_line(data = grid, mapping = aes(x = d.comp, y = pred),
colour = "red")
This method is more complicated than the geom_abline
method for a bivariate regression, but will work for more complicated models, while the geom_abline
method won’t.
Note that geom_smooth can be used to add a regression line to a data-set.
ggplot(data = face, mapping = aes(x = d.comp, y = diff.share)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
The argument method = "lm"
specifies that the function lm
is to be used to generate fitted values. It is equivalent to running the regression lm(y ~ x)
and plotting the regression line, where y
and x
are the aesthetics specified by the mappings. The argument se = FALSE
tells the function not to plot the confidence interval of the regression (discussed later).
4.2.4 Regression towards the mean
4.2.5 Merging Data Sets in R
See the R for Data Science chapter Relational data.
data("pres12", package = "qss")
To join both data frames
full_join(pres08, pres12, by = "state")
However, since there are duplicate names, .x
and .y
are appended.
Challenge: What would happen if by = "state"
was dropped?
To avoid the duplicate names, or change them, you can rename before merging,
full_join(select(pres08, state, Obama_08 = Obama, McCain_08 = McCain,
EV_08 = EV),
select(pres12, state, Obama_12 = Obama, Romney_12 = Romney,
EV_12 = EV),
by = "state")
or use the suffix
argument to full_join
pres <- full_join(pres08, pres12, by = "state", suffix = c("_08", "_12"))
head(pres)
#> state.name state Obama_08 McCain EV_08 margin Obama_12 Romney EV_12
#> 1 Alabama AL 39 60 9 -21 38 61 9
#> 2 Alaska AK 38 59 3 -21 41 55 3
#> 3 Arizona AZ 45 54 10 -9 45 54 11
#> 4 Arkansas AR 39 59 6 -20 37 61 6
#> 5 California CA 61 37 55 24 60 37 55
#> 6 Colorado CO 54 45 9 9 51 46 9
Challenge Would you consider this data tidy? How would you make it tidy?
The dplyr equivalent functions for cbind is bind_cols.
pres <- pres %>%
mutate(Obama2008_z = as.numeric(scale(Obama_08)),
Obama2012_z = as.numeric(scale(Obama_12)))
Likewise, bind_cols concatenates data frames by row.
We need to use the as.numeric
function because scale()
always returns a matrix. Omitting as.numeric()
would not produce an error in the code chunk above, since the columns of a data frame can be matrices, but it would produce errors in some of the following code if it were omitted.
Scatter plot of states with vote shares in 2008 and 2012
ggplot(pres, aes(x = Obama2008_z, y = Obama2012_z, label = state)) +
geom_abline(colour = "white", size = 2) +
geom_text() +
coord_fixed() +
scale_x_continuous("Obama's standardized vote share in 2008",
limits = c(-4, 4)) +
scale_y_continuous("Obama's standardized vote share in 2012",
limits = c(-4, 4))
To calculate the bottom and top quartiles
pres %>%
filter(Obama2008_z < quantile(Obama2008_z, 0.25)) %>%
summarise(improve = mean(Obama2012_z > Obama2008_z))
#> improve
#> 1 0.583
pres %>%
filter(Obama2008_z < quantile(Obama2008_z, 0.75)) %>%
summarise(improve = mean(Obama2012_z > Obama2008_z))
#> improve
#> 1 0.5
Challenge: Why is it important to standardize the vote shares?
4.2.6 Model Fit
data("florida", package = "qss")
fit2 <- lm(Buchanan00 ~ Perot96, data = florida)
fit2
#>
#> Call:
#> lm(formula = Buchanan00 ~ Perot96, data = florida)
#>
#> Coefficients:
#> (Intercept) Perot96
#> 1.3458 0.0359
Extract \(R ^ 2\) from the results of summary
,
summary(fit2)$r.squared
#> [1] 0.513
Alternatively, can get the R squared value from the data frame glance returns:
glance(fit2)
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> 1 0.513 0.506 316 68.5 9.47e-12 2 -480 966 972
#> deviance df.residual
#> 1 6506118 65
We can add predictions and residuals to the original data frame using the modelr functions add_residuals and add_predictions
florida <-
florida %>%
add_predictions(fit2) %>%
add_residuals(fit2)
glimpse(florida)
#> Observations: 67
#> Variables: 9
#> $ county <chr> "Alachua", "Baker", "Bay", "Bradford", "Brevard", "...
#> $ Clinton96 <int> 40144, 2273, 17020, 3356, 80416, 320736, 1794, 2712...
#> $ Dole96 <int> 25303, 3684, 28290, 4038, 87980, 142834, 1717, 2783...
#> $ Perot96 <int> 8072, 667, 5922, 819, 25249, 38964, 630, 7783, 7244...
#> $ Bush00 <int> 34124, 5610, 38637, 5414, 115185, 177323, 2873, 354...
#> $ Gore00 <int> 47365, 2392, 18850, 3075, 97318, 386561, 2155, 2964...
#> $ Buchanan00 <int> 263, 73, 248, 65, 570, 788, 90, 182, 270, 186, 122,...
#> $ pred <dbl> 291.3, 25.3, 214.0, 30.8, 908.2, 1400.7, 24.0, 280....
#> $ resid <dbl> -2.83e+01, 4.77e+01, 3.40e+01, 3.42e+01, -3.38e+02,...
There are now two new columns in florida
, pred
with the fitted values (predictions), and resid
with the residuals.
Use fit2_augment
to create a residual plot:
fit2_resid_plot <-
ggplot(florida, aes(x = pred, y = resid)) +
geom_ref_line(h = 0) +
geom_point() +
labs(x = "Fitted values", y = "residuals")
fit2_resid_plot
Note, we use the function geom_refline to add a reference line at 0.
Let’s add some labels to points, who is that outlier?
fit2_resid_plot +
geom_label(aes(label = county))
The outlier county is “Palm Beach”
arrange(florida) %>%
arrange(desc(abs(resid))) %>%
select(county, resid) %>%
head()
#> county resid
#> 1 PalmBeach 2302
#> 2 Broward -613
#> 3 Lee -357
#> 4 Brevard -338
#> 5 Miami-Dade -329
#> 6 Pinellas -317
Data without Palm Beach
florida_pb <- filter(florida, county != "PalmBeach")
fit3 <- lm(Buchanan00 ~ Perot96, data = florida_pb)
fit3
#>
#> Call:
#> lm(formula = Buchanan00 ~ Perot96, data = florida_pb)
#>
#> Coefficients:
#> (Intercept) Perot96
#> 45.8419 0.0244
\(R^2\) or coefficient of determination
glance(fit3)
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> 1 0.851 0.849 87.7 366 3.61e-28 2 -388 782 788
#> deviance df.residual
#> 1 492803 64
florida_pb %>%
add_residuals(fit3) %>%
add_predictions(fit3) %>%
ggplot(aes(x = pred, y = resid)) +
geom_ref_line(h = 0) +
geom_point() +
ylim(-750, 2500) +
xlim(0, 1500) +
labs(x = "Fitted values", y = "residuals")
Create predictions for both models using data_grid and gather_predictions:
florida_grid <-
florida %>%
data_grid(Perot96) %>%
gather_predictions(fit2, fit3) %>%
mutate(model =
fct_recode(model,
"Regression\n with Palm Beach" = "fit2",
"Regression\n without Palm Beach" = "fit3"))
Note this is an example of using non-syntactic column names in a tibble, as discussed in Chapter 10 of R for data science.
ggplot() +
geom_point(data = florida, mapping = aes(x = Perot96, y = Buchanan00)) +
geom_line(data = florida_grid,
mapping = aes(x = Perot96, y = pred,
colour = model)) +
geom_label(data = filter(florida, county == "PalmBeach"),
mapping = aes(x = Perot96, y = Buchanan00, label = county),
vjust = "top", hjust = "right") +
geom_text(data = tibble(label = unique(florida_grid$model),
x = c(20000, 31000),
y = c(1000, 300)),
mapping = aes(x = x, y = y, label = label, colour = label)) +
labs(x = "Perot's Vote in 1996", y = "Buchanan's Votes in 1996") +
theme(legend.position = "none")
See Graphics for communication in R for Data Science on labels and annotations in plots.
4.3 Regression and Causation
4.3.1 Randomized Experiments
Load data
data("women", package = "qss")
Proportion of female politicians in reserved GP vs. unreserved GP
women %>%
group_by(reserved) %>%
summarise(prop_female = mean(female))
#> # A tibble: 2 x 2
#> reserved prop_female
#> <int> <dbl>
#> 1 0 0.0748
#> 2 1 1.00
The diff in means estimator:
# drinking water facilities
# irrigation facilities
mean(women$irrigation[women$reserved == 1]) -
mean(women$irrigation[women$reserved == 0])
#> [1] -0.369
Mean values of irrigation
and water
in reserved and non-reserved districts.
women %>%
group_by(reserved) %>%
summarise(irrigation = mean(irrigation),
water = mean(water))
#> # A tibble: 2 x 3
#> reserved irrigation water
#> <int> <dbl> <dbl>
#> 1 0 3.39 14.7
#> 2 1 3.02 24.0
The difference between the two groups can be calculated with the function diff, which calculates the difference between subsequent observations. This works as long as we are careful about which group is first or second.
women %>%
group_by(reserved) %>%
summarise(irrigation = mean(irrigation),
water = mean(water)) %>%
summarise(diff_irrigation = diff(irrigation),
diff_water = diff(water))
#> # A tibble: 1 x 2
#> diff_irrigation diff_water
#> <dbl> <dbl>
#> 1 -0.369 9.25
The other way uses tidyr spread and gather,
women %>%
group_by(reserved) %>%
summarise(irrigation = mean(irrigation),
water = mean(water)) %>%
gather(variable, value, -reserved) %>%
spread(reserved, value) %>%
mutate(diff = `1` - `0`)
#> # A tibble: 2 x 4
#> variable `0` `1` diff
#> <chr> <dbl> <dbl> <dbl>
#> 1 irrigation 3.39 3.02 -0.369
#> 2 water 14.7 24.0 9.25
Now each row is an outcome variable of interest, and there are columns for the treatment (1
) and control (0
) groups, and the difference (diff
).
lm(water ~ reserved, data = women)
#>
#> Call:
#> lm(formula = water ~ reserved, data = women)
#>
#> Coefficients:
#> (Intercept) reserved
#> 14.74 9.25
lm(irrigation ~ reserved, data = women)
#>
#> Call:
#> lm(formula = irrigation ~ reserved, data = women)
#>
#> Coefficients:
#> (Intercept) reserved
#> 3.388 -0.369
4.3.2 Regression with multiple predictors
data("social", package = "qss")
glimpse(social)
#> Observations: 305,866
#> Variables: 6
#> $ sex <chr> "male", "female", "male", "female", "female", "mal...
#> $ yearofbirth <int> 1941, 1947, 1951, 1950, 1982, 1981, 1959, 1956, 19...
#> $ primary2004 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,...
#> $ messages <chr> "Civic Duty", "Civic Duty", "Hawthorne", "Hawthorn...
#> $ primary2006 <int> 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1,...
#> $ hhsize <int> 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 1, 2, 2, 1, 2, 2, 1,...
levels(social$messages)
#> NULL
fit <- lm(primary2006 ~ messages, data = social)
fit
#>
#> Call:
#> lm(formula = primary2006 ~ messages, data = social)
#>
#> Coefficients:
#> (Intercept) messagesControl messagesHawthorne
#> 0.31454 -0.01790 0.00784
#> messagesNeighbors
#> 0.06341
Create indicator variables for each message:
social <-
social %>%
mutate(Control = as.integer(messages == "Control"),
Hawthorne = as.integer(messages == "Hawthorne"),
Neighbors = as.integer(messages == "Neighbors"))
alternatively, create these using a for loop. This is easier to understand and less prone to typos:
for (i in unique(social$messages)) {
social[[i]] <- as.integer(social[["messages"]] == i)
}
We created a variable for each level of messages
even though we will exclude one of them.
lm(primary2006 ~ Control + Hawthorne + Neighbors, data = social)
#>
#> Call:
#> lm(formula = primary2006 ~ Control + Hawthorne + Neighbors, data = social)
#>
#> Coefficients:
#> (Intercept) Control Hawthorne Neighbors
#> 0.31454 -0.01790 0.00784 0.06341
Create predictions for each unique value of messages
unique_messages <-
data_grid(social, messages) %>%
add_predictions(fit)
unique_messages
#> # A tibble: 4 x 2
#> messages pred
#> <chr> <dbl>
#> 1 Civic Duty 0.315
#> 2 Control 0.297
#> 3 Hawthorne 0.322
#> 4 Neighbors 0.378
Compare to the sample averages
social %>%
group_by(messages) %>%
summarise(mean(primary2006))
#> # A tibble: 4 x 2
#> messages `mean(primary2006)`
#> <chr> <dbl>
#> 1 Civic Duty 0.315
#> 2 Control 0.297
#> 3 Hawthorne 0.322
#> 4 Neighbors 0.378
Linear regression without intercept.
fit_noint <- lm(primary2006 ~ -1 + messages, data = social)
fit_noint
#>
#> Call:
#> lm(formula = primary2006 ~ -1 + messages, data = social)
#>
#> Coefficients:
#> messagesCivic Duty messagesControl messagesHawthorne
#> 0.315 0.297 0.322
#> messagesNeighbors
#> 0.378
Calculating the regression average effect is also easier if we make the control group the first level so all regression coefficients are comparisons to it. Use fct_relevel to make “Control”
fit_control <-
mutate(social, messages = fct_relevel(messages, "Control")) %>%
lm(primary2006 ~ messages, data = .)
fit_control
#>
#> Call:
#> lm(formula = primary2006 ~ messages, data = .)
#>
#> Coefficients:
#> (Intercept) messagesCivic Duty messagesHawthorne
#> 0.2966 0.0179 0.0257
#> messagesNeighbors
#> 0.0813
Difference in means
social %>%
group_by(messages) %>%
summarise(primary2006 = mean(primary2006)) %>%
mutate(Control = primary2006[messages == "Control"],
diff = primary2006 - Control)
#> # A tibble: 4 x 4
#> messages primary2006 Control diff
#> <chr> <dbl> <dbl> <dbl>
#> 1 Civic Duty 0.315 0.297 0.0179
#> 2 Control 0.297 0.297 0
#> 3 Hawthorne 0.322 0.297 0.0257
#> 4 Neighbors 0.378 0.297 0.0813
Adjusted R-squared is included in the output of broom::glance()
glance(fit)
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC
#> 1 0.00328 0.00327 0.463 336 1.06e-217 4 -198247 396504
#> BIC deviance df.residual
#> 1 396557 65468 305862
glance(fit)[["adj.r.squared"]]
#> [1] 0.00327
4.3.3 Heterogeneous Treatment Effects
Average treatment effect (ate) among those who voted in 2004 primary
ate <-
social %>%
group_by(primary2004, messages) %>%
summarise(primary2006 = mean(primary2006)) %>%
spread(messages, primary2006) %>%
mutate(ate_Neighbors = Neighbors - Control) %>%
select(primary2004, Neighbors, Control, ate_Neighbors)
ate
#> # A tibble: 2 x 4
#> # Groups: primary2004 [2]
#> primary2004 Neighbors Control ate_Neighbors
#> <int> <dbl> <dbl> <dbl>
#> 1 0 0.306 0.237 0.0693
#> 2 1 0.482 0.386 0.0965
Difference in ATE in 2004 voters and non-voters
diff(ate$ate_Neighbors)
#> [1] 0.0272
social_neighbor <- social %>%
filter( (messages == "Control") | (messages == "Neighbors"))
fit_int <- lm(primary2006 ~ primary2004 + messages + primary2004:messages,
data = social_neighbor)
fit_int
#>
#> Call:
#> lm(formula = primary2006 ~ primary2004 + messages + primary2004:messages,
#> data = social_neighbor)
#>
#> Coefficients:
#> (Intercept) primary2004
#> 0.2371 0.1487
#> messagesNeighbors primary2004:messagesNeighbors
#> 0.0693 0.0272
lm(primary2006 ~ primary2004 * messages, data = social_neighbor)
#>
#> Call:
#> lm(formula = primary2006 ~ primary2004 * messages, data = social_neighbor)
#>
#> Coefficients:
#> (Intercept) primary2004
#> 0.2371 0.1487
#> messagesNeighbors primary2004:messagesNeighbors
#> 0.0693 0.0272
social_neighbor <-
social_neighbor %>%
mutate(age = 2008 - yearofbirth)
summary(social_neighbor$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 22.0 43.0 52.0 51.8 61.0 108.0
fit.age <- lm(primary2006 ~ age * messages, data = social_neighbor)
fit.age
#>
#> Call:
#> lm(formula = primary2006 ~ age * messages, data = social_neighbor)
#>
#> Coefficients:
#> (Intercept) age messagesNeighbors
#> 0.089477 0.003998 0.048573
#> age:messagesNeighbors
#> 0.000628
Calculate average treatment effects
ate.age <-
crossing(age = seq(from = 25, to = 85, by = 20),
messages = c("Neighbors", "Control")) %>%
add_predictions(fit.age) %>%
spread(messages, pred) %>%
mutate(diff = Neighbors - Control)
ate.age
#> # A tibble: 4 x 4
#> age Control Neighbors diff
#> <dbl> <dbl> <dbl> <dbl>
#> 1 25 0.189 0.254 0.0643
#> 2 45 0.269 0.346 0.0768
#> 3 65 0.349 0.439 0.0894
#> 4 85 0.429 0.531 0.102
You can use poly function to calculate polynomials instead of adding each term, age + I(age ^ 2)
. Though note that the coefficients will be be different since by default poly
calculates orthogonal polynomials instead of the natural (raw) polynomials. However, you really shouldn’t interpret the coefficients directly anyways, so this should matter.
fit.age2 <- lm(primary2006 ~ poly(age, 2) * messages,
data = social_neighbor)
fit.age2
#>
#> Call:
#> lm(formula = primary2006 ~ poly(age, 2) * messages, data = social_neighbor)
#>
#> Coefficients:
#> (Intercept) poly(age, 2)1
#> 0.2966 27.6665
#> poly(age, 2)2 messagesNeighbors
#> -10.2832 0.0816
#> poly(age, 2)1:messagesNeighbors poly(age, 2)2:messagesNeighbors
#> 4.5820 -5.5124
Create a data frame of combinations of ages and messages using data_grid, which means that we only need to specify the variables, and not the specific values,
y.hat <-
data_grid(social_neighbor, age, messages) %>%
add_predictions(fit.age2)
ggplot(y.hat, aes(x = age, y = pred,
colour = str_c(messages, " condition"))) +
geom_line() +
labs(colour = "", y = "Predicted turnout rates") +
theme(legend.position = "bottom")
y.hat %>%
spread(messages, pred) %>%
mutate(ate = Neighbors - Control) %>%
filter(age > 20, age < 90) %>%
ggplot(aes(x = age, y = ate)) +
geom_line() +
labs(y = "Estimated average treatment effect",
x = "Age") +
ylim(0, 0.1)
4.3.4 Regression Discontinuity Design
data("MPs", package = "qss")
MPs_labour <- filter(MPs, party == "labour")
MPs_tory <- filter(MPs, party == "tory")
labour_fit1 <- lm(ln.net ~ margin,
data = filter(MPs_labour, margin < 0))
labour_fit2 <- lm(ln.net ~ margin, MPs_labour, margin > 0)
tory_fit1 <- lm(ln.net ~ margin,
data = filter(MPs_tory, margin < 0))
tory_fit2 <- lm(ln.net ~ margin, data = filter(MPs_tory, margin > 0))
Use to generate a grid for predictions.
y1_labour <-
filter(MPs_labour, margin < 0) %>%
data_grid(margin) %>%
add_predictions(labour_fit1)
y2_labour <-
filter(MPs_labour, margin > 0) %>%
data_grid(margin) %>%
add_predictions(labour_fit2)
y1_tory <-
filter(MPs_tory, margin < 0) %>%
data_grid(margin) %>%
add_predictions(tory_fit1)
y2_tory <-
filter(MPs_tory, margin > 0) %>%
data_grid(margin) %>%
add_predictions(tory_fit2)
Tory politicians
ggplot() +
geom_ref_line(v = 0) +
geom_point(data = MPs_tory,
mapping = aes(x = margin, y = ln.net)) +
geom_line(data = y1_tory,
mapping = aes(x = margin, y = pred), colour = "red", size = 1.5) +
geom_line(data = y2_tory,
mapping = aes(x = margin, y = pred), colour = "red", size = 1.5) +
labs(x = "margin of victory", y = "log net wealth at death",
title = "labour")
We can actually produce this plot easily without running the regressions, by using geom_smooth
:
ggplot(mutate(MPs, winner = (margin > 0)),
aes(x = margin, y = ln.net)) +
geom_ref_line(v = 0) +
geom_point() +
geom_smooth(method = lm, se = FALSE, mapping = aes(group = winner)) +
facet_grid(party ~ .) +
labs(x = "margin of victory", y = "log net wealth at death")
In the previous code, I didn’t directly compute the the average net wealth at 0, so I’ll need to do that here. I’ll use gather_predictions to add predictions for multiple models:
spread_predictions(data_frame(margin = 0),
tory_fit1, tory_fit2) %>%
mutate(rd_est = tory_fit2 - tory_fit1)
#> # A tibble: 1 x 4
#> margin tory_fit1 tory_fit2 rd_est
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 12.5 13.2 0.650
tory_fit3 <- lm(margin.pre ~ margin, data = filter(MPs_tory, margin < 0))
tory_fit4 <- lm(margin.pre ~ margin, data = filter(MPs_tory, margin > 0))
(filter(tidy(tory_fit4), term == "(Intercept)")[["estimate"]] -
filter(tidy(tory_fit3), term == "(Intercept)")[["estimate"]])
#> [1] -0.0173