I need your help!

If you find any typos, errors, or places where the text may be improved, please let me know. The best ways to provide feedback are by GitHub or hypothes.is annotations.

Opening an issue or submitting a pull request on GitHub

Hypothesis Adding an annotation using hypothes.is. To add an annotation, select some text and then click the on the pop-up menu. To see the annotations of others, click the in the upper right-hand corner of the page.

24 Model building

24.1 Introduction

The splines package is needed for the ns() function used in one of the solutions.

24.2 Why are low quality diamonds more expensive?

This code appears in the section and is necessary for the exercises.

Exercise 24.2.1

In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?

The distribution of diamonds has more diamonds at round or otherwise human-friendly numbers (fractions).

Exercise 24.2.2

If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?

Following the examples in the chapter, I use a base-2 logarithm.

The estimated relationship between carat and price looks like this.

The plot shows that the estimated relationship between carat and price is not linear. The exact relationship in this model is if \(x\) increases \(r\) times, then \(y\) increases \(r^{a_1}\) times. For example, a two times increase in carat is associated with the following increase in price:

Let’s confirm this relationship by checking it for a few values of the carat variable. Let’s increase carat from 1 to 2.

Note that, since predict() predicts log2(carat) rather than carat, the prediction is exponentiated by 2. Now let’s increase carat from 4 to 2.

Finally, let’s increase carat from 0.5 to 1.

All of these examples return the same value, \(2 ^ {a_1} = 3.2\).

So why is this? Let’s ignore the names of the variables in this case and consider the equation: \[ \log_b y = a_0 + a_1 \log x \] We want to understand how the difference in \(y\) is related to the difference in \(x\). Now, consider this equation at two different values \(x_1\) and \(x_0\), \[ \log_b y_0 = a_0 + \log_b x_0 \\ \log_b y_1 = a_0 + \log_b y_1 \] What is the value of the difference, \(\log y_1 - \log y_0\)? \[ \begin{aligned}[t] \log_b(y_1) - \log_b(y_0) &= (a_0 + a_1 \log_b x_1) - (a_0 + a_1 \log x_0) ,\\ &= a_1 (\log_b x_1 - \log x_0) , \\ \log_b \left(\frac{y_1}{y_0} \right) &= \log_b \left(\frac{x_1}{x_0} \right)^{a_1} , \\ \frac{y_1}{y_0} &= \left( \frac{x_1}{x_0} \right)^{a_1} . \end{aligned} \] Let \(s = y_1 / y_0\) and \(r = x_1 / x_0\). Then, \[ s = r^{a_1} \text{.} \] In other words, an \(r\) times increase in \(x\), is associated with a \(r^{a_1}\) times increase in \(y\). Note that this relationship does not depend on the base of the logarithm, \(b\).

There is another approximation that is commonly used when logarithms appear in regressions.

The first way to show this is using the approximation that \(x\) is small, meaning that \(x \approx 0\), \[ \log (1 + x) \approx x \] This approximation is the first order Taylor expansion of the function at \(x = 0\). Now consider the relationship between the percent change in \(x\) and the percent change in \(y\), \[ \begin{aligned}[t] \log (y + \Delta y) - \log y &= (\alpha + \beta \log (x + \Delta x)) - (\alpha + \beta \log x) \\ \log \left(\frac{y + \Delta y}{y} \right) &= \beta \log\left( \frac{x + \Delta x}{x} \right) \\ \log \left(1 + \frac{\Delta y}{y} \right) &= \beta \log\left( 1 + \frac{\Delta x}{x} \right) \\ \frac{\Delta y}{y} &\approx \beta \left(\frac{\Delta x}{x} \right) \end{aligned} \] Thus a 1% percentage change in \(x\) is associated with a \(\beta\) percent change in \(y\).

This relationship can also be derived by taking the derivative of \(\log y\) with respect to \(x\). First, rewrite the equation in terms of \(y\), \[ y = \exp(a_0 + a_1 \log(x)) \] Then differentiate \(y\) with respect to \(x\), \[ \begin{aligned}[t] dy &= \exp(a_0 + a_1 \log x) \left(\frac{a_1}{x}\right) dx \\ &= a_1 y \left(\frac{dx}{x} \right) \\ (dy / y) &= a_1 (dx / x) \\ \%\Delta y &= a_1\%\Delta x \end{aligned} \]

Exercise 24.2.4

Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond?

Section 24.2.2 already provides part of the answer to this question.

Plotting the residuals of the model shows that there are some large outliers for small carat sizes. The largest of these residuals are a little over two, which means that the true value was four times lower; see Exercise 24.2.2. Most of the mass of the residuals is between -0.5 and 0.5, which corresponds to about \(\pm 40%\). There seems to be a slight downward bias in the residuals as carat size increases.

While in some cases the model can be wrong, overall the model seems to perform well. The root mean squared error is 0.19 meaning that the average error is about -14%. Another summary statistics of errors is the mean absolute error (MAE), which is the mean of the absolute values of the errors. The MAE is 0.15, which is -11%. Finally, 95% of the residuals are between -0.37 and 0.38, which correspond to 23—31.

Whether you think that this is a good model depends on factors outside the statistical model itself. It will depend on the how the model is being used. I have no idea how to price diamonds, so this would be useful to me in order to understand a reasonable price range for a diamond, so I don’t get ripped off. However, if I were buying and selling diamonds as a business, I would probably require a better model.

24.3 What affects the number of daily flights?

This code is copied from the book and needed for the exercises.

Exercise 24.3.1

Use your Google sleuthing skills to brainstorm why there were fewer than expected flights on Jan 20, May 26, and Sep 1. (Hint: they all have the same explanation.) How would these days generalize to another year?

These are the Sundays before Monday holidays Martin Luther King Jr. Day, Memorial Day, and Labor Day. For other years, use the dates of the holidays for those years—the third Monday of January for Martin Luther King Jr. Day, the last Monday of May for Memorial Day, and the first Monday in September for Labor Day.

Exercise 24.3.2

What do the three days with high positive residuals represent? How would these days generalize to another year?

The top three days correspond to the Saturday after Thanksgiving (November 30th), the Sunday after Thanksgiving (December 1st), and the Saturday after Christmas (December 28th).

We could generalize these to other years using the dates of those holidays on those years.

Exercise 24.3.3

Create a new variable that splits the wday variable into terms, but only for Saturdays, i.e., it should have Thurs, Fri, but Sat-summer, Sat-spring, Sat-fall How does this model compare with the model with every combination of wday and term?

I’ll use the function case_when() to do this, though there are other ways which it could be solved.

I think the overlapping plot is hard to understand. If we are interested in the differences, it is better to plot the differences directly. In this code, I use spread_residuals() to add one column per model, rather than gather_residuals() which creates a new row for each model.

The model with terms × Saturday has higher residuals in the fall and lower residuals in the spring than the model with all interactions.

Comparing models, mod3 has a lower \(R^2\) and regression standard error, \(\hat{\sigma}\), despite using fewer variables. More importantly for prediction purposes, this model has a higher AIC, which is an estimate of the out of sample error.

Exercise 24.3.4

Create a new wday variable that combines the day of week, term (for Saturdays), and public holidays. What do the residuals of that model look like?

The question is unclear how to handle public holidays. There are several questions to consider.

First, what are the public holidays? I include all federal holidays in the United States in 2013. Other holidays to consider would be Easter and Good Friday which is US stock market holiday and widely celebrated religious holiday, Mothers Day, Fathers Day, and Patriots’ Day, which is a holiday in several states, and other state holidays.

The model could include a single dummy variable which indicates a day was a public holiday. Alternatively, I could include a dummy variable for each public holiday. I would expect that Veteran’s Day and Washington’s Birthday have a different effect on travel than Thanksgiving, Christmas, and New Year’s Day.

Another question is whether and how I should handle the days before and after holidays. Travel could be lighter on the day of the holiday, but heavier the day before or after.

Exercise 24.3.5

What happens if you fit a day of week effect that varies by month (i.e., n ~ wday * month)? Why is this not very helpful?

daily <- mutate(daily, month = factor(lubridate::month(date)))
mod6 <- lm(n ~ wday * month, data = daily)
print(summary(mod6))
#> 
#> Call:
#> lm(formula = n ~ wday * month, data = daily)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -269.2   -5.0    1.5    8.8  113.2 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     867.400      7.598  114.16  < 2e-16 ***
#> wday.L          -64.074     20.874   -3.07  0.00235 ** 
#> wday.Q         -165.600     20.156   -8.22  7.8e-15 ***
#> wday.C          -68.259     20.312   -3.36  0.00089 ***
#> wday^4          -92.081     20.499   -4.49  1.0e-05 ***
#> wday^5            9.793     19.733    0.50  0.62011    
#> wday^6          -20.438     18.992   -1.08  0.28280    
#> month2           23.707     10.995    2.16  0.03191 *  
#> month3           67.886     10.746    6.32  1.0e-09 ***
#> month4           74.593     10.829    6.89  3.7e-11 ***
#> month5           56.279     10.746    5.24  3.2e-07 ***
#> month6           80.307     10.829    7.42  1.4e-12 ***
#> month7           77.114     10.746    7.18  6.4e-12 ***
#> month8           81.636     10.746    7.60  4.5e-13 ***
#> month9           51.371     10.829    4.74  3.3e-06 ***
#> month10          60.136     10.746    5.60  5.2e-08 ***
#> month11          46.914     10.829    4.33  2.1e-05 ***
#> month12          38.779     10.746    3.61  0.00036 ***
#> wday.L:month2    -3.723     29.627   -0.13  0.90009    
#> wday.Q:month2    -3.819     29.125   -0.13  0.89578    
#> wday.C:month2     0.490     29.233    0.02  0.98664    
#> wday^4:month2     4.569     29.364    0.16  0.87646    
#> wday^5:month2    -4.255     28.835   -0.15  0.88278    
#> wday^6:month2    12.057     28.332    0.43  0.67076    
#> wday.L:month3   -14.571     28.430   -0.51  0.60870    
#> wday.Q:month3    15.439     28.207    0.55  0.58458    
#> wday.C:month3     8.226     28.467    0.29  0.77282    
#> wday^4:month3    22.720     28.702    0.79  0.42926    
#> wday^5:month3   -15.330     28.504   -0.54  0.59113    
#> wday^6:month3    11.373     28.268    0.40  0.68776    
#> wday.L:month4   -16.668     29.359   -0.57  0.57067    
#> wday.Q:month4    10.725     28.962    0.37  0.71142    
#> wday.C:month4    -0.245     28.725   -0.01  0.99320    
#> wday^4:month4    23.288     28.871    0.81  0.42056    
#> wday^5:month4   -17.872     28.076   -0.64  0.52494    
#> wday^6:month4     5.352     27.888    0.19  0.84794    
#> wday.L:month5     3.666     29.359    0.12  0.90071    
#> wday.Q:month5   -20.665     28.670   -0.72  0.47163    
#> wday.C:month5     4.634     28.725    0.16  0.87196    
#> wday^4:month5     5.999     28.511    0.21  0.83349    
#> wday^5:month5   -16.912     28.076   -0.60  0.54742    
#> wday^6:month5    12.764     27.194    0.47  0.63916    
#> wday.L:month6    -4.526     28.651   -0.16  0.87459    
#> wday.Q:month6    23.813     28.207    0.84  0.39927    
#> wday.C:month6    13.758     28.725    0.48  0.63234    
#> wday^4:month6    24.118     29.187    0.83  0.40932    
#> wday^5:month6   -17.648     28.798   -0.61  0.54048    
#> wday^6:month6    10.526     28.329    0.37  0.71051    
#> wday.L:month7   -28.791     29.359   -0.98  0.32760    
#> wday.Q:month7    49.585     28.670    1.73  0.08482 .  
#> wday.C:month7    54.501     28.725    1.90  0.05881 .  
#> wday^4:month7    50.847     28.511    1.78  0.07559 .  
#> wday^5:month7   -33.698     28.076   -1.20  0.23106    
#> wday^6:month7   -13.894     27.194   -0.51  0.60979    
#> wday.L:month8   -20.448     28.871   -0.71  0.47938    
#> wday.Q:month8     6.765     28.504    0.24  0.81258    
#> wday.C:month8     6.001     28.467    0.21  0.83319    
#> wday^4:month8    19.074     28.781    0.66  0.50806    
#> wday^5:month8   -19.312     28.058   -0.69  0.49183    
#> wday^6:month8     9.507     27.887    0.34  0.73341    
#> wday.L:month9   -30.341     28.926   -1.05  0.29511    
#> wday.Q:month9   -42.034     28.670   -1.47  0.14373    
#> wday.C:month9   -20.719     28.725   -0.72  0.47134    
#> wday^4:month9   -20.375     28.791   -0.71  0.47973    
#> wday^5:month9   -18.238     28.523   -0.64  0.52308    
#> wday^6:month9    11.726     28.270    0.41  0.67861    
#> wday.L:month10  -61.051     29.520   -2.07  0.03954 *  
#> wday.Q:month10  -26.235     28.504   -0.92  0.35815    
#> wday.C:month10  -32.435     28.725   -1.13  0.25979    
#> wday^4:month10  -12.212     28.990   -0.42  0.67389    
#> wday^5:month10  -27.686     27.907   -0.99  0.32201    
#> wday^6:month10    0.123     26.859    0.00  0.99634    
#> wday.L:month11  -54.947     28.926   -1.90  0.05851 .  
#> wday.Q:month11   16.012     28.670    0.56  0.57696    
#> wday.C:month11   54.950     28.725    1.91  0.05677 .  
#> wday^4:month11   47.286     28.791    1.64  0.10164    
#> wday^5:month11  -44.740     28.523   -1.57  0.11787    
#> wday^6:month11  -20.688     28.270   -0.73  0.46491    
#> wday.L:month12   -9.506     28.871   -0.33  0.74221    
#> wday.Q:month12   75.209     28.504    2.64  0.00879 ** 
#> wday.C:month12  -25.026     28.467   -0.88  0.38010    
#> wday^4:month12  -23.780     28.781   -0.83  0.40938    
#> wday^5:month12   20.447     28.058    0.73  0.46676    
#> wday^6:month12    9.586     27.887    0.34  0.73128    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 42 on 281 degrees of freedom
#> Multiple R-squared:  0.836,	Adjusted R-squared:  0.787 
#> F-statistic: 17.2 on 83 and 281 DF,  p-value: <2e-16

If we fit a day of week effect that varies by month, there will be 12 * 7 = 84 parameters in the model. Since each month has only four to five weeks, each of these day of week \(\times\) month effects is the average of only four or five observations. These estimates have large standard errors and likely not generalize well beyond the sample data, since they are estimated from only a few observations.

Exercise 24.3.6

What would you expect the model n ~ wday + ns(date, 5) to look like? Knowing what you know about the data, why would you expect it to be not particularly effective?

Previous models fit in the chapter and exercises show that the effects of days of the week vary across different times of the year. The model wday + ns(date, 5) does not interact the day of week effect (wday) with the time of year effects (ns(date, 5)).

I estimate a model which does not interact the day of week effects (mod7) with the spline to that which does (mod8). I need to load the splines package to use the ns() function.

The residuals of the model that does not interact day of week with time of year (mod7) are larger than those of the model that does (mod8). The model mod7 underestimates weekends during the summer and overestimates weekends during the autumn.

Exercise 24.3.7

We hypothesized that people leaving on Sundays are more likely to be business travelers who need to be somewhere on Monday. Explore that hypothesis by seeing how it breaks down based on distance and time: if it’s true, you’d expect to see more Sunday evening flights to places that are far away.

Comparing the average distances of flights by day of week, Sunday flights are the second longest. Saturday flights are the longest on average. Saturday may have the longest flights on average because there are fewer regularly scheduled short business/commuter flights on the weekends but that is speculation.

Hide outliers.

Try pointrange with mean and standard error of the mean (sd / sqrt(n)).

Try pointrange with mean and standard error of the mean (sd / sqrt(n)).

Exercise 24.3.8

It’s a little frustrating that Sunday and Saturday are on separate ends of the plot. Write a small function to set the levels of the factor so that the week starts on Monday.

See the chapter Factors for the function fct_relevel(). Use fct_relevel() to put all levels in-front of the first level (“Sunday”).

Now Monday is the first day of the week.

24.4 Learning more about models

No exercises