\[ \DeclareMathOperator{\E}{E} \DeclareMathOperator{\mean}{mean} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Bias}{Bias} \DeclareMathOperator{\MSE}{MSE} \DeclareMathOperator{\RMSE}{RMSE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\se}{se} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\median}{median} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\logistic}{Logistic} \DeclareMathOperator{\logit}{Logit} \newcommand{\mat}[1]{\boldsymbol{#1}} \newcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\T}{'} % This follows BDA \newcommand{\dunif}{\mathsf{Uniform}} \newcommand{\dnorm}{\mathsf{Normal}} \newcommand{\dhalfnorm}{\mathrm{HalfNormal}} \newcommand{\dlnorm}{\mathsf{LogNormal}} \newcommand{\dmvnorm}{\mathsf{Normal}} \newcommand{\dgamma}{\mathsf{Gamma}} \newcommand{\dinvgamma}{\mathsf{InvGamma}} \newcommand{\dchisq}{\mathsf{ChiSquared}} \newcommand{\dinvchisq}{\mathsf{InvChiSquared}} \newcommand{\dexp}{\mathsf{Exponential}} \newcommand{\dlaplace}{\mathsf{Laplace}} \newcommand{\dweibull}{\mathsf{Weibull}} \newcommand{\dwishart}{\mathsf{Wishart}} \newcommand{\dinvwishart}{\mathsf{InvWishart}} \newcommand{\dlkj}{\mathsf{LkjCorr}} \newcommand{\dt}{\mathsf{StudentT}} \newcommand{\dhalft}{\mathsf{HalfStudentT}} \newcommand{\dbeta}{\mathsf{Beta}} \newcommand{\ddirichlet}{\mathsf{Dirichlet}} \newcommand{\dlogistic}{\mathsf{Logistic}} \newcommand{\dllogistic}{\mathsf{LogLogistic}} \newcommand{\dpois}{\mathsf{Poisson}} \newcommand{\dBinom}{\mathsf{Binomial}} \newcommand{\dmultinom}{\mathsf{Multinom}} \newcommand{\dnbinom}{\mathsf{NegativeBinomial}} \newcommand{\dnbinomalt}{\mathsf{NegativeBinomial2}} \newcommand{\dbetabinom}{\mathsf{BetaBinomial}} \newcommand{\dcauchy}{\mathsf{Cauchy}} \newcommand{\dhalfcauchy}{\mathsf{HalfCauchy}} \newcommand{\dbernoulli}{\mathsf{Bernoulli}} \newcommand{\R}{\mathbb{R}} \newcommand{\Reals}{\R} \newcommand{\RealPos}{\R^{+}} \newcommand{\N}{\mathbb{N}} \newcommand{\Nats}{\N} \newcommand{\cia}{\perp\!\!\!\perp} \DeclareMathOperator*{\plim}{plim} \DeclareMathOperator{\invlogit}{Inv-Logit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\diag}{diag} \]

16 Heteroskedasticity

Prerequisites

library("rstan")
library("tidyverse")

16.1 Introduction

Consider the linear regression model with normal errors, \[ y_i \sim \dnorm\left(\ X \beta, \sigma_i^2 \right) . \] Note that since the term \(\sigma_i\) is indexed by the observation, it can vary by observation. If \(\sigma_i = 1\) for all \(i\), then this corresponds to the classical homoskedastic linear regression. If \(\sigma_i\) differs for each \(i\), then it is a heteroskedastic regression.

In frequentist estimation linear regressions with heteroskedastic are often estimated using OLS with heteroskedasticity-consistent (HC) standard errors.12 However, HC standard errors are not a generative model, and in the Bayesian setting it is preferable to write a generative model that specifies a model for \(\sigma^2\).

16.2 Weighted Regression

A common case is “weighted” regression, where each \(y_i\) represents the mean of \(n_i\) observations. Then the scale of each observation is \[ \sigma_i = \omega / n_i , \] where \(\omega\) is a global scale.

Alternatively, suppose each observation represents the sum of each \(n_i\) observations. Then the scale of each observation is, \[ \sigma_i = n_i \omega . \]

16.3 Modeling the Scale with Covariates

The scale can also be modeled with covariates.

It is common to model the log-transformation of the scale or variance to transform it to \(\R\), \[ \log \sigma_i = \dnorm(Z_i \gamma, \omega) \] where \(Z_i\) are covariates used to the model the variance, which may or may not be the same as \(X_i\).

Another common model is the variance as a function of the mean, \[ \begin{align} \log \sigma_i = f(\mu_i). \end{align} \]

Consider the well-known normal approximation of the binomial distribution, \[ \dnorm(n_i | N_i, \pi_i) \approx \dnorm(n_i | \pi N_i, \pi (1 - \pi) N_i) . \] At the cost of treating the outcome as continuous rather than discrete, this approximation can provide a flexible model for over- or under-dispersion, by adding a dispersion term \(\delta \in R^{+}\), \[ n_i \sim \dnorm(\pi N_i, \delta \pi (1 - \pi)) . \] A similar approximation can be applied to unbounded count models using the normal approximation to the Poisson, \(\dpois(y_i | \lambda_i) \approx \dnorm(y_i | \lambda_i \lambda_i)\).

16.4 Prior Distributions

A reparameterization that will be used quite often is to rewrite a normal distributions with unequal scale parameters as the product of a common global scale parameter (\(\omega\)), and observation specific local scale parameters, \(\lambda_i\),13 \[ y_i \sim \dnorm(X\beta, \lambda_i \omega) . \] If the local variance parameters are distributed inverse-gamma, \[ \lambda^2 \sim \dinvgamma(\nu / 2, \nu / 2) \] then the above is equivalent to a regression with errors distributed Student-t errors with \(\nu\) degrees of freedom, \[ y_i \sim \dt\left(\nu, X \beta, \sigma \right) . \]

Note that if a random variable \(X\) is distributed inverse-gamma, it is equivalent to \(1 / X\) being distributed gamma. In this example, \[ \dinvgamma(\lambda^2 | \nu / 2, \nu / 2) = \dgamma\left(\frac{1}{\lambda^2} \middle| \nu / 2, \nu / 2\right) = \]

Example: Simulate Student-\(t\) distribution with \(\nu\) degrees of freedom as a scale mixture of normal. For *s in 1:S$,

  1. Simulate \(z_s \sim \dgamma(\nu / 2, \nu / 2)\)
  2. \(x_s = 1 / \sqrt{z_s}\) is draw from \(\dt(\nu, 0, 1)\).

When using R, ensure that you are using the correct parameterization of the gamma distribution. Left to reader

We can also model heteroskedasticity by placing a prior distribution on the variances.

\[ \sigma_i = \omega \lambda_i . \] Thus, the heteroskedastic likelihood is, \[ y_i \sim \dnorm\left( \mu_i, \omega^2 \lambda_i^2\right) \] Note that since the number of \(\lambda_i\) parameters are equal to the number of observations, this model will not have a proper posterior distribution without a proper prior distribution. However, if a proper prior is placed on \(\lambda_i\), then the posterior distribution for this model exists.

Suppose \(1 / \lambda_i^2\) is distributed with a specific gamma distribution, \[ 1 / \lambda_i^2 \sim \dgamma(d / 2, d / 2) . \] This \[ \dt(y_i | \nu, \mu_i, \omega) = \int \dnorm(y | \mu_i, \omega^2 \lambda^2) \dinvgamma(\nu / 2, \nu / 2) d \lambda^2 \]

This is equivalent to a regression model with Student-t errors. \[ y_i \sim \dt\left(d, ., \omega \right) . \] Thus, “robust” regression models with Student-\(t\) errors can be derived from a particular model of heteroskedastic normal errors.

The Stan model that estimates this is lm_student_t_1.stan:

// lm_student_t_1.stan
// Linear Model with Student-t Errors
data {
  // number of observations
  int N;
  // response
  vector[N] y;
  // number of columns in the design matrix X
  int K;
  // design matrix X
  // should not include an intercept
  matrix [N, K] X;
  // priors on alpha
  real scale_alpha;
  vector[K] scale_beta;
  real loc_sigma;
  // keep responses
  int use_y_rep;
  int use_log_lik;
}
parameters {
  // regression coefficient vector
  real alpha;
  vector[K] beta;
  real sigma;
  // degrees of freedom;
  // limit df = 2 so that there is a finite variance
  real nu;
}
transformed parameters {
  vector[N] mu;

  mu = alpha + X * beta;
}
model {
  // priors
  alpha ~ normal(0.0, scale_alpha);
  beta ~ normal(0.0, scale_beta);
  sigma ~ exponential(loc_sigma);
  // see Stan prior distribution suggestions
  nu ~ gamma(2, 0.1);
  // likelihood
  y ~ student_t(nu, mu, sigma);
}
generated quantities {
  // simulate data from the posterior
  vector[N * use_y_rep] y_rep;
  // log-likelihood posterior
  vector[N * use_log_lik] log_lik;
  for (i in 1:num_elements(y_rep)) {
    y_rep[i] = student_t_rng(nu, mu[i], sigma);
  }
  for (i in 1:num_elements(log_lik)) {
    log_lik[i] = student_t_lpdf(y[i] | nu, mu[i], sigma);
  }
}

We could also apply other prior distributions to the Another flexible model of heteroskedasticity is to apply a Dirichlet model to the local scales, \[ \begin{aligned}[t] \lambda_i &\sim \ddirichlet(a, w), & \lambda_i \geq 0, \sum_i \lambda_i = 1 . \end{aligned} \]

16.4.1 Examples: Duncan

Estimate the linear regression with the Duncan data using heteroskedastic errors.

data("Duncan", package = "carData")
mod_norm <- stan_model("stan/lm_normal_1.stan", verbose = FALSE)
mod_t <- stan_model("stan/lm_student_t_1.stan", verbose = FALSE)

16.5 Exercises

  • Estimate examples in the hett package with Stan.

16.6 References

For more on heteroskedasticity see A. Gelman, Carlin, et al. (2013 Sec. 14.7) for models with unequal variances and correlations.

Stan Development Team (2016) discusses reparameterizing the Student t distribution as a mixture of gamma distributions in Stan.


  1. See https://arxiv.org/pdf/1101.1402.pdf and http://econ.ucsb.edu/~startz/Bayesian%20Heteroskedasticity-Robust%20Regression.pdf.

  2. See this for a visualization of a Student-t distribution a mixture of Normal distributions, and this for a derivation of the Student t distribution as a mixture of normal distributions. This scale mixture of normal representation will also be used with shrinkage priors on the regression coefficients.