\[ \DeclareMathOperator{\E}{E} \DeclareMathOperator{\mean}{mean} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Bias}{Bias} \DeclareMathOperator{\MSE}{MSE} \DeclareMathOperator{\RMSE}{RMSE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\se}{se} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \newcommand{\Mat}[1]{\boldsymbol{#1}} \newcommand{\Vec}[1]{\boldsymbol{#1}} \newcommand{\T}{'} \newcommand{\distr}[1]{\mathcal{#1}} \newcommand{\dnorm}{\distr{N}} \newcommand{\dmvnorm}[1]{\distr{N}_{#1}} \newcommand{\dt}[1]{\distr{T}_{#1}} \newcommand{\cia}{\perp\!\!\!\perp} \DeclareMathOperator*{\plim}{plim} \]

Chapter 7 Cross-Validation

7.1 Example: Predicting Bordeaux Wine

The bordeaux dataset contains the prices of vintages of Bordeaux wines sold at auction in New York, as well as the the weather and rainfall for the years of the wine. AshenfelterAshemoreLalonde1995a uses this data was used to show that the quality of a wine vintage, as measured by its price, can largely be predicted by a small number of variables. At the time, these prediction were not taken kindly by wine connoisseurs.[^wine]

The data contains 27 prices of Bordeaux wines sold at auction in New York in 1993 for vintages from 1952 to 1980; the years 1952 and 1954 were excluded because they were rarely sold. Prices are measured as an index where 100 is the price of the 1961.

The dataset also includes three predictors of the price of each vintage:

  • time_sv: Time since the vintage, where 0 is 1983.
  • wrain: Winter (October–March) rain
  • hrain: Harvest (August–September) rain
  • degrees: Average temperature in degrees centigrade from April to September in the vintage year.

The first variable to consider is the age of the vintage and the price:

Ashenfelter, Ashmore, and Lalonde (1995) run two models. All models were estimated using OLS with log-price as the outcome variable. The predictors in the models were:

  1. vintage age
  2. vintage age, winter rain, harvest rain

We will start by considering these models. Since we are running several models, we’ll define the model formulas in a list

Run each model and store the results in a data frame as list column of lm objects:

## # A tibble: 2 x 3
##   model formula                                    mod     
##   <int> <chr>                                      <list>  
## 1     1 lprice ~ time_sv                           <S3: lm>
## 2     2 lprice ~ time_sv + wrain + hrain + degrees <S3: lm>

Now that we have these models, extract the coefficients into a data frame with the broom function tidy:

## Observations: 7
## Variables: 9
## $ model     <int> 1, 1, 2, 2, 2, 2, 2
## $ formula   <chr> "lprice ~ time_sv", "lprice ~ time_sv", "lprice ~ ti...
## $ term      <chr> "(Intercept)", "time_sv", "(Intercept)", "time_sv", ...
## $ estimate  <dbl> -2.025199170, 0.035429560, -12.145333577, 0.02384741...
## $ std.error <dbl> 0.2472286519, 0.0136624941, 1.6881026571, 0.00716671...
## $ statistic <dbl> -8.191604, 2.593199, -7.194665, 3.327521, 2.420525, ...
## $ p.value   <dbl> 1.522111e-08, 1.566635e-02, 3.278791e-07, 3.055739e-...
## $ conf.low  <dbl> -2.534376e+00, 7.291126e-03, -1.564624e+01, 8.984546...
## $ conf.high <dbl> -1.516022230, 0.063567993, -8.644422940, 0.038710280...
## 
## Call:
## lm(formula = .x, data = bordeaux)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8545 -0.4788 -0.0718  0.4562  1.2457 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.02520    0.24723  -8.192 1.52e-08 ***
## time_sv      0.03543    0.01366   2.593   0.0157 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5745 on 25 degrees of freedom
## Multiple R-squared:  0.212,  Adjusted R-squared:  0.1804 
## F-statistic: 6.725 on 1 and 25 DF,  p-value: 0.01567
## 
## 
## Call:
## lm(formula = .x, data = bordeaux)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.46027 -0.23864  0.01347  0.18600  0.53446 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.215e+01  1.688e+00  -7.195 3.28e-07 ***
## time_sv      2.385e-02  7.167e-03   3.328  0.00306 ** 
## wrain        1.167e-03  4.820e-04   2.421  0.02420 *  
## hrain       -3.861e-03  8.075e-04  -4.781 8.97e-05 ***
## degrees      6.164e-01  9.518e-02   6.476 1.63e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2865 on 22 degrees of freedom
## Multiple R-squared:  0.8275, Adjusted R-squared:  0.7962 
## F-statistic: 26.39 on 4 and 22 DF,  p-value: 4.058e-08

Likewise, extract model statistics such as, \(R^2\), adjusted \(R^2\), and \(\hat{\sigma}\):

## # A tibble: 2 x 4
##   formula                                    r.squared adj.r.squared sigma
##   <chr>                                          <dbl>         <dbl> <dbl>
## 1 lprice ~ time_sv                               0.212         0.180 0.574
## 2 lprice ~ time_sv + wrain + hrain + degrees     0.828         0.796 0.287

7.2 Cross Validation

The larger the polynomial, the more wiggly the line.

Intuitively it seems that as we increase the flexibility of the model by increasing the number of variables the model is overfitting the data, but what does it actually mean to overfit? If we use \(R^2\) as the “measure of fit”, more variables always leads to better fit. Adjusted \(R^2\) does not increase, because the decrease in errors is offset by the increase in the degrees of freedom. However, there is little justification for the specific formula of \(R^2\).

The problem with over-fitting is that the model starts to fit peculiarities of the sample (errors) rather than the true model. Since we never know the true model, all we can check is if the model predicts new data.

7.3 Out-of-Sample Error

To compare predictive models, we want to compare how well they predicts (duh). This means estimating how well it will work on new data. The problem with this is that new data is just that, …, new.

The trick is to reuse the sample data to get an estimate of how well the model will work on new data. This is done by fitting the model on a subset of the data, and predicting another subset of the data which was not used to fit the model; often this is done repeatedly. There are a variety of ways to do this, depending on the nature of the data and the predictive task. However, they all implicitly assume that the sample of data that was used to fit the model is representative of future data.

7.3.1 Held-out data

A common rule of thumb is to use 70% of the data for training, and 30% of the data for testing.

In this case, let’s partition the data to use the first 70% of the observations as training data, and the remaining 30% of the data as testing.

## 
## Call:
## lm(formula = lprice ~ time_sv + wrain + hrain + degrees, data = head(bordeaux, 
##     n_train))
## 
## Coefficients:
## (Intercept)      time_sv        wrain        hrain      degrees  
##  -1.080e+01    1.999e-02    9.712e-04   -4.461e-03    5.533e-01
## [1] 0.2280059

The out-of-sample RMSE is higher than the in-sample RMSE.

## [1] 0.351573

This is common, but not necessarily the case. But note that this value is highly dependent on the subset of data used for testing. In some sense, we may choose as model that “overfits” the testing data.

7.3.2 \(k\)-fold Cross-validation

A more robust approach is to repeat this training/testing split multiple times.

The most common approach is to to partition the data into k-folds, and use each fold once as the testing subset, where the model is fit on the other \(k - 1\) folds.

A common rule of thumb is to use 5 to 10 folds.

7.3.3 Leave-one-Out Cross-Validation

Leave-one-out cross validation estimates is a \(k\)-fold cross-validation with folds equal to the number of observations. The model is fit \(n\) times, leaving training the model on \(n - 1\) observations and predicted the remaining observation.

7.4 Approximations

For some models, notably linear regression, analytical approximations to the expected out of sample error can be made. Each of these approximations will make some slightly different assumptions to plug in some unknown values.

In linear regression, the LOO-CV MSE can be calculated analytically, and without simulation. It is (James et al. 2013, 180): \[ \text{LOO-CV} = \frac{1}{n} \sum_{i = 1}^n {\left(\frac{y_i - \hat{y}_i}{1 - h_i} \right)}^2 = \frac{1}{n} \sum_{i = 1}^n {\left(\frac{\hat{\epsilon}_i}{1 - h_i} \right)}^2 = \frac{1}{n} \times \text{PRESS} \] where PRESS is the predictive residual sum of squares, and \(h_i\) is the hat-value of observation \(i\) (Fox 2016, 270, 289), \[ h_i = \Mat{X}(\Mat{X}' \Mat{X})^{-1} \Mat{X}' \]

An alternative approximation of the expected out-of-sample error is the generalized cross-validation criterion (GCV) is (Fox 2016), \[ \text{GCV} = \frac{n \times RSS}{df_{res}^2} = \frac{n \sum \hat{\epsilon}^2}{(n - k - 1)^2} \]

Other measures that are also equivalent to some form of an estimate of the out-of-sample error are the AIC and BIC.

References

Ashenfelter, Orley, David Ashmore, and Robert Lalonde. 1995. “Bordeaux Wine Vintage Quality and Weather.” Chance 8 (4). https://doi.org/10.1080/09332480.1995.10542468.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning with Applications in R. 6th ed. Springer.

Fox, John. 2016. Applied Regression Analysis & Generalized Linear Models. 3rd ed. Sage.