\[ \DeclareMathOperator{\E}{E} \DeclareMathOperator{\mean}{mean} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Bias}{Bias} \DeclareMathOperator{\MSE}{MSE} \DeclareMathOperator{\RMSE}{RMSE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\se}{se} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \newcommand{\Mat}[1]{\boldsymbol{#1}} \newcommand{\Vec}[1]{\boldsymbol{#1}} \newcommand{\T}{'} \newcommand{\distr}[1]{\mathcal{#1}} \newcommand{\dnorm}{\distr{N}} \newcommand{\dmvnorm}[1]{\distr{N}_{#1}} \newcommand{\dt}[1]{\distr{T}_{#1}} \newcommand{\cia}{\perp\!\!\!\perp} \DeclareMathOperator*{\plim}{plim} \]

Chapter 5 Bootstrapping

The central analogy of bootstrapping is

The population is to the sample as the sample is to the bootstrap samples (Fox 2008, 590)

To calculate standard errors to use in confidence intervals we need to know sampling distribution of the statistic of interest.

In the case of a mean, we can appeal to the central limit theorem if the sample size is large enough.

Bootstrapping takes a different approach. We use the sample as an estimator of the sampling distribution. E.g. bootstrap claims \[ \text{sample distribution} \approx \text{population distribution} \] and then proceeds to plug-in the sample distribution for the population distribution, and then draw new samples to generate a sampling distribution.

The bootstrap relies upon the plug-in principle. The plug-in principle is that when something is unknown, use an estimate of it. An example is the use of the sample standard deviation in place of the population standard deviation, when calculating the standard error of the mean, \[ \SE(\bar{x}) = \frac{\sigma}{\sqrt{n}} \approx \frac{\hat{\sigma}}{\sqrt{n}} \] Bootstrap is the plug-in principal on ’roids. It uses the empirical distribution as a plug-in for the unknown population distribution. See Figures 4 and 5 of Hesterberg (2015).

Bootstrap principles

  1. The substitution of the empirical distribution for the population works.
  2. Sample with replacement.
  • The bootstrap is for inference not better estimates. It can estimate uncertainty, not improve \(\bar{x}\). It is not generating new data out of nowhere. However, see the section on bagging for how bootstrap aggregation can be used.

5.1 Non-parametric bootstrap

The non-parametric bootstrap resamples the data with replacement \(B\) times and calculates the statistic on each resample.

5.2 Standard Errors

The bootstrap is primarily a means to calculate standard errors.

The bootstrap standard error is

Suppose there are \(r\) bootstrap replicates. Let \(\hat{\theta}^{*}_1, \dots, \hat{\theta}^{*}_r\) be statistics calculated on each bootstrap samples. \[ \SE^{*}\left(\hat{\theta}^{*}\right) = \sqrt{\frac{\sum_{b = 1}^r {(\hat{\theta}^{*}_b - \bar{\theta}^{*})}^2}{r - 1}} \] where \(\bar{\theta}^{*}\) is the mean of bootstrap statistics, \[ \bar{\theta}^{*} = \frac{\sum_{b = 1}^r}{r} . \]

5.3 Confidence Intervals

There are multiple ways to calculate confidence intervals from bootstrap.

  • Normal-Theory Intervals
  • Percentile Intervals
  • ABC Intervals

5.4 Alternative methods

5.4.1 Parametric Bootstrap

The parametric bootstrap draws samples from the estimated model.

For example, in linear regression, we can start from the model, \[ y_i = \Vec{x}_i \Vec{\beta} + \epsilon_i \]

  1. Estimate the regression model to get \(\hat{\beta}\) and \(\hat{\sigma}\)

  2. For \(1, \dots, r\) bootstrap replicates:

    1. Generate bootstrap sample \((\Vec{y}^{*}, \Mat{X})\), where \(\Mat{X}\) are those from the original sample, and the values of \(\Vec{y}^{*}\) are generated by sampling from the residual distribution, \[ y_i^{*}_b = \Vec{x}_i \Vec{\hat{\beta}} + \epsilon^{*}_{i,b} \] where \(\epsilon^{*}_{i,b} \sim \mathrm{Normal}(0, \hat{\sigma})\).

    2. Re-estimate a regression on \((\Vec{y}^{*}, \Mat{X})\) to estimate \(\hat{\beta}^{*}\).

    3. Calculate any statistics of the regression results.

Alternatively, we could have drawn the values of \(\Vec{\epsilon}^*_b\) from the empirical distribution of residuals or the Wild Bootstrap.

See the the discussion in the boot::boot() function, for sim = "parametric".

5.4.2 Clustered bootstrap

We can incorporate complex sampling methods into the bootstrap (Fox 2008, Sec 21.5). In particular, by resampling clusters instead of individual observations, we get the clustered bootstrap.(Esarey and Menger 2017)

5.4.3 Time series bootstrap

Since data are not independent in time-series, variations of the bootstrap have to be used. See the references in the documentation for boot::tsboot.

5.4.4 How to sample?

Draw the bootstrap sample in the same way it was drawn from the population (if possible) (Hesterberg 2015, 19)

The are a few exceptions:

  • Condition on the observed information. We should fix known quantities, e.g. observed sample sizes of sub-samples (Hesterberg 2015)
  • For hypothesis testing, the sampling distribution needs to be modified to represent the null distribution (Hesterberg 2015)

5.4.5 Caveats

  • Bootstrapping does not work well for the median or other quantities that depend on the small number of observations out of larger sample.(Hesterberg 2015)
  • Uncertainty in the bootstrap estimator is due to both (1) Monte Carlo sampling (taking a finite number of samples), and (2) the sample itself. The former can be decreased by increasing the number of bootstrap samples. The latter is irreducible without a new sample.
  • The bootstrap distribution will reflect the data. If the sample was “unusual”, then the bootstrap distribution will also be so.(Hesterberg 2015)
  • In small samples there is a narrowness bias. (Hesterberg 2015, 24). As always, small samples is problematic.

5.4.6 Why use bootstrapping?

  • The common practice of relying on asymmetric results may understate variability by ignoring dependencies or heteroskedasticity. These can be incorporated into bootstrapping.(Fox 2008, 602)
  • it is general purpose algorithm that can generate standard errors and confidence intervals in cases where an analytic solution does not exist.
  • however, it may require programming to implement and computational power to execute

5.5 Bagging

Note that in all the previous discussion, the original point estimate is used. Bootstrapping is only used to generate (1) standard errors and confidence intervals (2).

Bootstrap aggregating or bagging is a meta-algorithm that constructs a point estimate by averaging the point-estimates from bootstrap samples. Bagging can reduce the variance of some estimators, so can be thought of as a sort of regularization method.

5.6 Hypothesis Testing

Hypothesis testing with bootstrap is more complicated.

5.7 How many samples?

There is no fixed rule of thumb (it will depend on the statistic you are calculating and the population distribution), but if you want a single number, 1,000 is good lower bound.

  • Higher levels of confidence require more samples

  • Note that the results of the percentile method will be more variable than the normal-approximation method. The ABC confidence intervals will be even better.

One ad-hoc recipe suggested here is:

  1. Choose a \(B\)
  2. Run the bootstrap
  3. Run the bootstrap again (ensure there is a different random number seed)
  4. If results differ, increase the size.

Davidson and MacKinnon (2000) suggest the following:

  • 5%: 399
  • 1%: 1499

Though it also suggests a pre-test method.

Hesterberg (2015) suggests far a larger bootstrap sample size: 10,000 for routine use. It notes that for a t-test, 15,000 samples for the a 95% probability that the one-sided levels fall within 10% of the true values, for 95% intervals and 5% tests.

5.8 References

See Fox (2008 Ch. 21).

Hesterberg (2015) is for “teachers of statistics” but is a great overview of bootstrapping. I found it more useful than the treatment of bootstrapping in many textbooks.

For some Monte Carlo results on the accuracy of the bootstrap see Hesterberg (2015), p. 21.

R packages. For general purpose bootstrapping and cross-validation I suggest the rsample package, which works well with the tidyverse and seems to be useful going forward.

The boot package included in the recommended R packages is a classic package that implements many bootstrapping and resampling methods. Most of them are parallelized. However, its interface is not as nice as rsample.

See this spreadsheet for some Monte Carlo simulations on Bootstrap vs. t-statistic.


Fox, John. 2008. Applied Regression Analysis & Generalized Linear Models. 2nd ed. Sage.

Hesterberg, Tim C. 2015. “What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum.” The American Statistician 69 (4). Taylor & Francis: 371–86. https://doi.org/10.1080/00031305.2015.1089789.

Esarey, Justin, and Andrew Menger. 2017. “Practical and Effective Approaches to Dealing with Clustered Data.” Working Paper. http://jee3.web.rice.edu/cluster-paper.pdf.

Davidson, Russell, and James G. MacKinnon. 2000. “Bootstrap Tests: How Many Bootstraps?” Econometric Reviews 19 (1). Taylor & Francis: 55–68. https://doi.org/10.1080/07474930008800459.