\[ \DeclareMathOperator{\E}{E} \DeclareMathOperator{\mean}{mean} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Bias}{Bias} \DeclareMathOperator{\MSE}{MSE} \DeclareMathOperator{\RMSE}{RMSE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\se}{se} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\median}{median} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\logistic}{Logistic} \DeclareMathOperator{\logit}{Logit} \newcommand{\mat}[1]{\boldsymbol{#1}} \newcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\T}{'} % This follows BDA \newcommand{\dunif}{\mathsf{Uniform}} \newcommand{\dnorm}{\mathsf{Normal}} \newcommand{\dhalfnorm}{\mathrm{HalfNormal}} \newcommand{\dlnorm}{\mathsf{LogNormal}} \newcommand{\dmvnorm}{\mathsf{Normal}} \newcommand{\dgamma}{\mathsf{Gamma}} \newcommand{\dinvgamma}{\mathsf{InvGamma}} \newcommand{\dchisq}{\mathsf{ChiSquared}} \newcommand{\dinvchisq}{\mathsf{InvChiSquared}} \newcommand{\dexp}{\mathsf{Exponential}} \newcommand{\dlaplace}{\mathsf{Laplace}} \newcommand{\dweibull}{\mathsf{Weibull}} \newcommand{\dwishart}{\mathsf{Wishart}} \newcommand{\dinvwishart}{\mathsf{InvWishart}} \newcommand{\dlkj}{\mathsf{LkjCorr}} \newcommand{\dt}{\mathsf{StudentT}} \newcommand{\dhalft}{\mathsf{HalfStudentT}} \newcommand{\dbeta}{\mathsf{Beta}} \newcommand{\ddirichlet}{\mathsf{Dirichlet}} \newcommand{\dlogistic}{\mathsf{Logistic}} \newcommand{\dllogistic}{\mathsf{LogLogistic}} \newcommand{\dpois}{\mathsf{Poisson}} \newcommand{\dBinom}{\mathsf{Binomial}} \newcommand{\dmultinom}{\mathsf{Multinom}} \newcommand{\dnbinom}{\mathsf{NegativeBinomial}} \newcommand{\dnbinomalt}{\mathsf{NegativeBinomial2}} \newcommand{\dbetabinom}{\mathsf{BetaBinomial}} \newcommand{\dcauchy}{\mathsf{Cauchy}} \newcommand{\dhalfcauchy}{\mathsf{HalfCauchy}} \newcommand{\dbernoulli}{\mathsf{Bernoulli}} \newcommand{\R}{\mathbb{R}} \newcommand{\Reals}{\R} \newcommand{\RealPos}{\R^{+}} \newcommand{\N}{\mathbb{N}} \newcommand{\Nats}{\N} \newcommand{\cia}{\perp\!\!\!\perp} \DeclareMathOperator*{\plim}{plim} \DeclareMathOperator{\invlogit}{Inv-Logit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\diag}{diag} \]

11 Introduction to Stan and Linear Regression

This chapter is an introduction to writing and running a Stan model in R. Also see the rstan vignette for similar content.

Prerequisites

library("rstan")
library("tidyverse")
library("recipes")

For this section we will use the duncan dataset included in the carData package. Duncan’s occupational prestige data is an example dataset used throughout the popular Fox regression text, Applied Regression Analysis and Generalized Linear Models (Fox 2016). It is originally from Duncan (1961) consists of survey data on the prestige of occupations in the US in 1950, and several predictors: type of occupation, income, and education of that

data("Duncan", package = "carData")

11.1 OLS and MLE Linear Regression

The first step in running a Stan model is defining the Bayesian statistical model that will be used for inference.

We will model prestige of each occupation as a function of its education, occupation, and type.

A standard way to do this is with the OLS estimator: \[ \begin{multline} y_i = \beta_0 + \beta_1 I(\mathtt{type} = \mathtt{"prof"}) + \beta_2 I(\mathtt{type} = \mathtt{"wc"}) \\ \quad + \beta_3 \mathtt{income} + \beta_4 \mathtt{education} + \epsilon_i \end{multline} \]

duncan_lm <- lm(prestige ~ type + income + education, data = Duncan)

\[ y_i = x_i' \beta + \epsilon_i \] OLS finds \(\hat{\beta}_{OLS}\) by minimizing the squared errors, \[ \hat{\beta}_{\text{OLS}} = \arg \min_{b} \sum_{i = 1}^n (y_i - x_i' b)^2 . \] OLS is an estimator of the (linear approximation of) the conditional expectation function, \[ \mathrm{CEF}(y_i | x_i) = E(y_i, x_i' \beta) . \]

For valid inference we need to make assumptions about \(\epsilon_i\), namely that they are uncorrelated with \(X\), \(\Cov(\epsilon, X) = 0\), and that they are i.i.d, \(\Cov(\epsilon_i, \epsilon_j) = 0\), \(\Var(\epsilon_i) = \sigma^2\) for all \(i\). However, no specific distributional form is or needs to be assumed for \(\epsilon\) since CLT results show that, asymptotically the sampling distribution of \(\beta\) approaches the normal. Additionally, although \(\hat\sigma^2 = \sum_{i = 1}^n \epsilon_i / (n - k - 1)\) is a estimator of \(\sigma^2\), standard errors of the standard error of the regression are not directly provided.

However, the OLS estimator is also the same as the MLE estimator for \(\beta\) (but not \(\sigma\)): \[ \begin{aligned}[t] p(y_1, \dots, y_n | \beta, \sigma, x_1, \dots, x_n) &= \prod_{i = 1}^n p(y_i | \beta, x_i) \\ &= \prod_{i = 1}^n N(y_i | x_i' \beta) \\ &= \prod_{i = 1}^n \frac{1}{\sigma \sqrt{2 \pi}} \left( \frac{-(y_i - x_i' \beta)}{2 \sigma^2} \right) \end{aligned} \] so, \[ \hat{\beta}_{MLE}, \hat{\sigma}_{MLE} = \arg\max_{b,s} \prod_{i = 1}^n N(y_i | x_i' b, s^2) . \] And \(\hat{\beta}_{MLE} = \hat{\beta}_{OLS}\).

Note that the OLS estimator is equivalent to the MLE estimator of \(\beta\), \[ \begin{aligned}[t] \hat{\beta}_{MLE} &= \arg \max_{b} \prod_{i = 1}^n N(y_i | x_i' b, \sigma^2) \\ &= \arg \max_{b} \prod_{i = 1}^n \frac{1}{\sigma \sqrt{2 \pi}} \exp \left( \frac{-(y_i - x_i' \beta)^2}{2 \sigma^2} \right) \\ &= \arg \max_{b} \log \left( \prod_{i = 1}^n \frac{1}{\sigma \sqrt{2 \pi}} \exp \left( \frac{-(y_i - x_i' \beta)}{2 \sigma^2} \right) \right) \\ &= \arg \max_{b} \sum_{i = 1}^n - \log \sigma - \frac{1}{2} \log 2 \pi + \frac{-(y_i - x_i' \beta)^2}{2 \sigma^2} \\ &= \arg \max_{b} \sum_{i = 1}^n -(y_i - x_i' \beta)^2 \\ &= \arg \min_{b} \sum_{i = 1}^n (y_i - x_i' \beta)^2 \\ &= \hat{\beta}_{OLS} \end{aligned} \] However, the estimator of \(\sigma^2_{MLE} \neq \sigma^2_{OLS}\).

11.1.1 Bayesian Model with Improper priors

In Bayesian inference, our target is the posterior distribution of the parameters, \(\beta\) and \(\sigma\): \(p(\beta, \sigma^2 | y, X)\).

\[ p(\beta, \sigma | y, X) \propto p(y | \beta, \sigma) p(\beta, \sigma) \]

For a Bayesian linear regression model, we’ll need to specify distributions for \(p(y | \beta, \sigma)\) and \(p(\beta, \sigma)\).

Likelihood: \(p(y_i | x_i, \beta, \sigma)\) suppose that the observations are distributed independent normal: \[ y_i \sim \dnorm(\beta'x_i, \sigma^2) \]

Priors: The model needs to specify a prior distribution for the parameters \((\beta, \sigma)\). Rather than specify a single distribution for \(\beta\) and \(\sigma\), it will be easier to specify independent (separate) distributions for \(\beta\) and \(\sigma\).

We will use what are called an improper uniform priors. An improper prior is, \[ p(\theta) \propto C \] where \(C\) is some constants. This function puts an equal density on all values of the support of \(\theta\). This function is not a proper probability density function since \(\int_{\theta \in \Theta} C d \theta = \infty\). However, for some Bayesian models, the prior does not need to be a proper probability function for the posterior to be a probability function. In this example we will put improper prior distributions on \(\beta\) and \(\sigma\). \[ p(\beta, \sigma) = C \]

\[ \begin{aligned} p(\beta, \sigma | x, y) &\propto p(y| \beta, \sigma, x) p(\beta, \sigma, x) \\ &= \prod_{i = 1}^n N(y_i | x_i' \beta, \sigma^2) \cdot C \\ &\propto \prod_{i = 1}^n N(y_i | x_i' \beta, \sigma^2) \end{aligned} \]

Note that under the improper priors, the posterior is proportional to the likelihood, \[ p(\beta, \sigma | x, y) \propto p(y | x, \beta, \sigma) \] Thus the MAP (maximum a posterior) estimator is the same as the MLE, \[ \hat{\beta}_{MAP}, \hat{\sigma}_{MAP} = \arg\max_{\beta, \sigma} p(\beta, \sigma | x, y) = \arg \max_{\beta, \sigma} p(y | x, \beta, \sigma) = \hat{\beta}_{MLE}, \hat{\sigma}_{MLE} \]

11.2 Stan Model

Let’s write and estimate our model in Stan. Stan models are written in its own domain-specific language that focuses on declaring the statistical model (parameters, variables, distributions) while leaving the details of the sampling algorithm to Stan.

A Stan model consists of blocks which contain declarations of variables and/or statements. Each block has a specific purpose in the model.

11.3 Sampling Model with Stan

functions {
    // OPTIONAL: user-defined functions
}
data {
    // read in data ...
}
transformed data {
    // Create new variables/auxiliary variables from the data
}
parameters {
    // Declare parameters that will be estimated
}
transformed parameters {
    // Create new variables/auxiliary variables from the parameters
}
model {
    // Declare your probability model: priors, hyperpriors & likelihood
}
generated quantities {
    // Declare any quantities other than simulated parameters to be generated
}

The file lm0.stan is a Stan model for the linear regression model previously defined.

// lm_normal_1.stan
// Linear Model with Normal Errors
data {
  // number of observations
  int N;
  // response
  vector[N] y;
  // number of columns in the design matrix X
  int K;
  // design matrix X
  // should not include an intercept
  matrix [N, K] X;
  // priors on alpha
  real scale_alpha;
  vector[K] scale_beta;
  real loc_sigma;
  // keep responses
  int use_y_rep;
  int use_log_lik;
}
parameters {
  // regression coefficient vector
  real alpha;
  vector[K] beta;
  real sigma;
}
transformed parameters {
  vector[N] mu;

  mu = alpha + X * beta;
}
model {
  // priors
  alpha ~ normal(0., scale_alpha);
  beta ~ normal(0., scale_beta);
  sigma ~ exponential(loc_sigma);
  // likelihood
  y ~ normal(mu, sigma);
}
generated quantities {
  // simulate data from the posterior
  vector[N * use_y_rep] y_rep;
  // log-likelihood posterior
  vector[N * use_log_lik] log_lik;
  for (i in 1:num_elements(y_rep)) {
    y_rep[i] = normal_rng(mu[i], sigma);
  }
  for (i in 1:num_elements(log_lik)) {
    log_lik[i] = normal_lpdf(y[i] | mu[i], sigma);
  }
}
mod1 <- stan_model("stan/lm_normal_1.stan")

See the Stan Modeling Language User’s Guide and Reference Manual for details of the Stan Language.

NoteSince a Stan model compiles to C++ code, you may receive some warning messages such as

/Library/Frameworks/R.framework/Versions/3.3/Resources/library/StanHeaders/include/stan/math/rev/core/set_zero_all_adjoints.hpp:14:17: warning: unused function 'set_zero_all_adjoints' [-Wunused-function]
    static void set_zero_all_adjoints() {
                ^
In file included from file1d4a4d50faa.cpp:8:
In file included from /Library/Frameworks/R.framework/Versions/3.3/Resources/library/StanHeaders/include/src/stan/model/model_header.hpp:4:

As long as your model compiles, you can ignore these compiler warnings (On the other hard, warnings that occur during sampling should not be ignored). If the Stan model does not give you a syntax error when parsing the model, it should compile to valid C++.[^bugs][^c-warnings] See

[bugs]: In the rare case that the Stan parser transpiles the Stan model to C++ but cannot compile the C++ code, it is a bug in Stan. Follow the instructions on how to inform the Stan developers about bugs. [c-warnings]: The extended installation instructions for MacOS/Linux and Windows have instructions for adding compiler options to the R Makevars file.

11.3.1 Sampling

In order to sample from the model, we need to at least give it the values for the data to use: `,k,y,X`, and the data associated with the priors.

The data types in Stan are all numeric (either integers or reals), but they include matrices and vectors. However, there is nothing like a data frame in Stan. Whereas in the R function lm we can provide a formula and a data set for where to look for objects, and the function will create the appropriate \(X\) matrix for the regression, we will need to create that matrix ourselves—expanding categorical variables to indicator variables, and expanding interactions and other functions of the predictors.

rec <- recipe(prestige ~ income + education + type, data = Duncan) %>%
  step_dummy(type) %>%
  prep(data = Duncan, retain = TRUE) 
X <- juice(rec, all_predictors(), composition = "matrix")
y <- drop(juice(rec, all_outcomes(), composition = "matrix"))
mod1_data <- list(
  X = X,
  K = ncol(X),
  N = nrow(X),
  y = y,
  use_y_rep = FALSE,
  use_log_lik = FALSE
)

We still need to provide the values for the prior distributions. For specific values of the prior distributions, assume uninformative priors for beta by setting the mean to zero and the variances to large numbers.

mod1_data$scale_alpha <- sd(y) * 10
mod1_data$scale_beta <- apply(X, 2, sd) * sd(y) * 2.5
mod1_data$loc_sigma <- sd(y)

Now, sample from the posterior, using the function sampling:

mod1_fit <- sampling(mod1, data = mod1_data)
summary(mod1_fit)

11.3.2 Convergence Diagnostics and Model Fit

  • Convergence Diagnostics: Is this the posterior distribution that you were looking for? These don’t directly say anything about how “good” the model is in terms representing the data, they are only evaluating how well the sampler is doing at sampling the posterior distribution of the given model. If there are problems with these, then the sample results do not represent the posterior distribution, and your inferences will be biased.

    • mcse:
    • n_eff:
    • Rhat
    • divergences
  • Model fit: Is this statistical model appropriate for the data? Or better than other models?

    • Posterior predictive checks

    • Information criteria:

      • WAIC
      • Leave-one-out Cross-Validation