\[ \DeclareMathOperator{\E}{E} \DeclareMathOperator{\mean}{mean} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Bias}{Bias} \DeclareMathOperator{\MSE}{MSE} \DeclareMathOperator{\RMSE}{RMSE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\se}{se} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\median}{median} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\logistic}{Logistic} \DeclareMathOperator{\logit}{Logit} \newcommand{\mat}[1]{\boldsymbol{#1}} \newcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\T}{'} % This follows BDA \newcommand{\dunif}{\mathsf{Uniform}} \newcommand{\dnorm}{\mathsf{Normal}} \newcommand{\dhalfnorm}{\mathrm{HalfNormal}} \newcommand{\dlnorm}{\mathsf{LogNormal}} \newcommand{\dmvnorm}{\mathsf{Normal}} \newcommand{\dgamma}{\mathsf{Gamma}} \newcommand{\dinvgamma}{\mathsf{InvGamma}} \newcommand{\dchisq}{\mathsf{ChiSquared}} \newcommand{\dinvchisq}{\mathsf{InvChiSquared}} \newcommand{\dexp}{\mathsf{Exponential}} \newcommand{\dlaplace}{\mathsf{Laplace}} \newcommand{\dweibull}{\mathsf{Weibull}} \newcommand{\dwishart}{\mathsf{Wishart}} \newcommand{\dinvwishart}{\mathsf{InvWishart}} \newcommand{\dlkj}{\mathsf{LkjCorr}} \newcommand{\dt}{\mathsf{StudentT}} \newcommand{\dhalft}{\mathsf{HalfStudentT}} \newcommand{\dbeta}{\mathsf{Beta}} \newcommand{\ddirichlet}{\mathsf{Dirichlet}} \newcommand{\dlogistic}{\mathsf{Logistic}} \newcommand{\dllogistic}{\mathsf{LogLogistic}} \newcommand{\dpois}{\mathsf{Poisson}} \newcommand{\dBinom}{\mathsf{Binomial}} \newcommand{\dmultinom}{\mathsf{Multinom}} \newcommand{\dnbinom}{\mathsf{NegativeBinomial}} \newcommand{\dnbinomalt}{\mathsf{NegativeBinomial2}} \newcommand{\dbetabinom}{\mathsf{BetaBinomial}} \newcommand{\dcauchy}{\mathsf{Cauchy}} \newcommand{\dhalfcauchy}{\mathsf{HalfCauchy}} \newcommand{\dbernoulli}{\mathsf{Bernoulli}} \newcommand{\R}{\mathbb{R}} \newcommand{\Reals}{\R} \newcommand{\RealPos}{\R^{+}} \newcommand{\N}{\mathbb{N}} \newcommand{\Nats}{\N} \newcommand{\cia}{\perp\!\!\!\perp} \DeclareMathOperator*{\plim}{plim} \DeclareMathOperator{\invlogit}{Inv-Logit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\diag}{diag} \]

8 MCMC Diagnostics

There are two parts of checking a Bayesian model:

diagnostics: Is the sampler working? Is it adequately approximating the specified posterior distribution: \(p(\theta | D)\).
model fit: Does the model adequately represent the data?

This chapter covers the former.

Also see the bayesplot vignette Visual MCMC diagnostics using the bayesplot package, which though specific to the provides, provides a good overview of these diagnostics.

Prerequisites

library("rstan")
library("tidyverse")

8.1 Reparameterize Models

Reduce correlation between parameters (e.g. see mcmc_pairs)
Put parameters on the same scale. The samplers work best when all parameters are roughly on the same scale, e.g. \(\approx 1\). Try to avoid situations where parameters are orders of magnitude different, e.g. 1e-5 and 1e+10.
Increase the informativeness of priors. If parameters are too uninformative, the posterior distribution may have wide tails that hamper sampling. One way of thinking about it is that the model is only “weakly identified” and requires either more data or more informative priors to estimate.

8.2 Convergence Diagnostics

Under certain conditions, MCMC algorithms will draw a sample from the target posterior distribution after it has converged to equilibrium. However, since in practice, any sample is finite, there is no guarantee about whether its converged, or is close enough to the posterior distribution.

In general there is no way to prove that the sampler has converged.⁶ However, there are several statistics that indicate that a sampler has not converged.

8.2.1 Potential Scale Reduction (\(\hat{R}\))

In equilibrium, the distribution of samples from chains should be the same regardless of the initial starting values of the chains (Stan Development Team 2016, Sec 28.2).

One way to check this is to compare the distributions of multiple chains—in equilibrium they should all have the same mean. Additionally, the split \(\hat{R}\) tests for convergence by splitting the chain in half, and testing the hypothesis that the means are the same in each half. This tests for non-stationarity within a chain.

See Stan Development Team (2016 Sec 28.2) for the equations to calculate these.

TODO: Examples of passing and non-passing \(\hat{R}\) chains using fake data generated from known functions with a given autocorrelation.

Rule of Thumb: The rule of thumb is that R-hat values for all less than 1.1 source. Note that all parameters must show convergence. This is a necessary but not sufficient condition for convergence.

8.2.2 References

A. Gelman, Carlin, et al. (2013, 267)
Stan Development Team (2016 Ch 28.) for how Stan calculates Hat, autocorrelations, and ESS.
Gelman and Rubin (1992) introduce the R-hat statistic

8.3 Autocorrelation, Effective Sample Size, and MCSE

MCMC samples are dependent. This does not effect the validity of inference on the posterior if the samplers has time to explore the posterior distribution, but it does affect the efficiency of the sampler.

In other words, highly correlated MCMC samplers requires more samples to produce the same level of Monte Carlo error for an estimate.

8.3.1 Effective Sample Size

The effective sample size (ESS) measures the amount by which autocorrelation in samples increases uncertainty (standard errors) relative to an independent sample. Suppose that the \(\rho^2_t\) is the ACF function of a sample of size \(N\), the effective sample size, \(N_eff\), is \[ N_{eff} = \frac{N}{\sum_{t = -\infty}^\infty \rho_t} = \frac{N}{1 + 2 \sum_{t = -\infty}^\infty \rho_t}. \] TODO show that if \(\rho_t = 1\) for all \(t\) then \(N_eff = 1\), and if \(\rho_t = 0\) for all \(t\) then \(N_eff = N\)

Computing the effective sample size requires calculating an autocorrelation. A multi-chain estimate of the autocorrelation is found by computing the variogram with the correlations for all lags, \[ V_t = \frac{1}{m(n - t)} \sum_{j = 1}^m \sum_{i = t + 1}^n (\psi_{i,j} - \psi_{i - t,j})^2 \] The estimate of the autocorrelations \(\hat{\rho}_t\) is \[ \hat{\rho}_t = 1 - \frac{V_t}{2 \widehat{\mathrm{var}}^+} \] The estimates of the autocorrelations can be noisy, so \(\hat{\rho}_t\) are summed from 0, to the last \(t\) such that \(\rho\) is positive (\(T\)), \[ \hat{n}_{eff} = \frac{mn}{1 + 2 \sum_{t = 1}^T \hat{\rho}_t} \]

See also Stan Development Team (2016 Sec 28.4), Geyer (2011), and A. Gelman, Carlin, et al. (2013 Sec 11.5).
This isn’t the only way to calculate the effective sample size. The coda package function coda uses a different method. The differences are due to how the autocorrelation is calculated.

Example: Comparison of the effective sample sizes for data generated with various levels of autocorrelation. The package rstan does not directly expose the function it uses to calculate ESS, so this ess function does so (for a single chain).

ess <- function(x) {
  N <- length(x)
  V <- map_dbl(seq_len(N - 1),
          function(t) {
             mean(diff(x, lag = t) ^ 2, na.rm = TRUE)
          })
  rho <- head_while(1 - V / var(x), ~ . > 0)
  N / (1 + sum(rho))
}
n <- 1024
sims <- map_df(c(0, 0.5, 0.75, 0.99),
  function(ar) {
    tibble(ar = ar,
           y = if (ar == 0) {
             rnorm(n)
           } else {
             as.numeric(arima.sim(list(ar = ar), n))
           },
           x = seq_along(y),
           n_eff = ess(y),
           label = sprintf("AR = %.2f (n_eff = %.0f)", ar, n_eff))
  }
)

ggplot(sims, aes(x = x, y = y)) +
  geom_line() +
  facet_wrap(~ label, scales = "free_y") +
  labs(x = "", y = "")

8.4 Thinning

Since the autocorrelation tends to decrease as the lag increases, thinning samples will reduce the final autocorrelation in the sample while also reducing the total number of samples saved. Due to the autocorrelation, the reduction in the number of effective samples will often be less than number of samples removed in thinning.

Both of these will produce 1,000 samples from the posterior, but effective sample size of \(B\) will be greater than the effective sample size of \(A\), since after thinning g the autocorrelation in \(B\) will be lower.

A Generating 1,000 samples after convergence and save all of them
B Generating 10,000 samples after convergence and save every 10th sample

In this case, A produces 10,000 samples, and B produces 1,000. The effective sample size of A will be higher than B. However, due to autocorrelation, the proportional reduction in the effective sample size in B will be less than the thinning: \(N_{eff}(A) / N_{eff}(B) < 10\).

A Generating 10,000 samples after convergence and save all of them
B Generating 10,000 samples after convergence and save every 10th sample

Thinning trades off sample size for memory, and due to autocorrelation in samples, loss in effective sample size is less than the loss in sample size.

Thinning has become less of an issue as memory has become less of a computational constraint, and samplers have become more efficient.

The following example simulates random values from an autocorrelated series, and applies different levels of thinning. Thinning is always decreasing the effective sample size. However, the number of effective samples per sample (n_eff / n) increases until the thinning is large enough that the thinned samples are uncorrelated.

thin_ess <- function(thin, x) {
  if (thin > 1) {
    # keep only thinned rows
    x_thinned <- x[(seq_len(length(x)) %% thin) == 1]
  } else {
    x_thinned <- x
  }
  tibble(thin = thin,
         n = length(x_thinned),
         n_eff = ess(x_thinned),
         `n_eff / n` = n_eff / n)
}

map_df(c(1, 2, 4, 8, 16, 32), thin_ess,
       x = arima.sim(list(ar = .9), 4096))
#> # A tibble: 6 x 4
#>    thin     n n_eff `n_eff / n`
#>   <dbl> <int> <dbl>       <dbl>
#> 1     1  4096 1266.       0.309
#> 2     2  2048 1087.       0.531
#> 3     4  1024  791.       0.773
#> 4     8   512  512        1    
#> 5    16   256  256        1    
#> 6    32   128  128        1

A. Gelman, Carlin, et al. (2013, 282–83)
Stan Development Team (2016, 354–55)

8.4.1 Traceplots

Trace plots are a time series of sampler iterations, e.g. as produced by bayesplot.

These can, but should not, be used to assess convergence,

such visual inspection is ‘notoriously unreliable’ (A. Gelman, Carlin, et al. 2013, 285)
it cannot scale to many parameters

Trace plots may be useful for diagnosing convergence problems after \(\hat{R}\) or or \(n_eff\) indicates problems. Some possible issues to check in these plots are

multimodality (the traceplot jumps between different distributions)
wide posterior tails (the traceplot shows regions where the sampler will reach and have difficulty returning to the main distribution)

8.4.2 Monte Carlo Standard Error (MCSE)

The Monte Carlo standard error is the uncertainty about a statistic in the sample due to sampling error. With a independent sample of size \(N\), the MCSE for the sample mean is \[ MCSE(\bar{\theta}) = \frac{s}{\sqrt{N}} \] where \(s\) is the sample standard deviation.

However, MCMC are generally not independent, and the MCSE will be higher than that of an independent sample. One way to calculate the MCSE with autocorrelated samples is to use the effective sample size instead of the sample size, \[ MCSE(\bar{\theta}) = \frac{s}{\sqrt{N_{eff}}} \]

An MCSE estimator for the mean is \[ \mathrm{MCSE}(\hat{\theta}) = \frac{\sd(\theta)}{\sqrt{n_{neff}}} \] An MCSE estimator for any posterior probability, where \(\hat{p} = \Pr(f(\theta))\), follows from the standard error of a proportion, but using the effective sample size, \[ MCSE(\hat{p}) = \sqrt{\hat{p} (1 - \hat{p}) / n_{eff}} \]

See Flegal, Haran, and Jones (2008) and the mcmcse for methods to calculate MCMC standard errors for means and quantiles using sub-sampling methods.

Flegal, Haran, and Jones (2008) argues for using ESS as a stopping rule and convergence diagnostic for Bayesian inference.

The estimation of standard errors for quantiles, as would be used in is more complicated. See the package mcmcse for Monte Carlo standard errors of quantiles (though calculated in a different method than rstan).

A. Gelman, Carlin, et al. (2013 Sec. 10.5)
Flegal, Haran, and Jones (2008)
Talk by Geyer on MCSE

8.5 HMC-NUT Specific Diagnostics

Hamiltonian Monte Carlo (HMC), and the No-U-Turn Sampler (HMC-NUTS) in particular, produce several diagnostics that indicate that the sampler is breaking and, thus, not sampling from the posterior distribution. This is unusual, as most Bayesian sampling methods do not give indication of whether they are working well, and all that can be checked are the properties of the samples themselves with methods such \(\hat{R}\).

Three specific HMC-NUTS diagnostics are

divergent transitions
maximum tree-depth
Bayesian fraction of missing information

The general way to fix these issues is the manually adjust the HMC-NUTS sampler parameters.n

Stepsize: Length of the steps to take
Tree-depth: Number of steps to take

During the warmup period, Stan tunes these values, however these auto-tuned parameters may not always be optimal. The other alternative is to reparameterize the models.

8.5.1 Divergent transitions

The details of the HMC are technical and can be found TODO. The gist of the problem is that Stan is using a discrete approximation of a continuous function when integrating. If the step sizes are too large, the discrete approximation does not work. Helpfully, when the approximation is poor it does not fail without any indication but will produce “divergent transitions”.

If there are too many divergent transitions, then the sampler is not drawing samples from the entire posterior and inferences will be biased.

Reduce the step size. This can be done by increasing the the adapt_delta parameter. This is the target average proposal acceptance probability in the adaptation, which is used to determine the step size during warmup. A higher desired acceptance probability (closer to 1) reduces the the step size. A smaller step size means that it will require more steps to explore the posterior distribution.

See Stan Development Team (2016, 380)

8.5.2 Maximum Tree-depth

NUTS is an intelligent method to select the number of steps to take in each iteration. However, there is still a maximum number of steps that NUTS will try. If the sampler is often hitting the maximum number of steps, it means that the optimal number of steps to take in each iteration is higher than the maximum. While divergent transitions bias inference, a too-small maximum tree-depth only affects efficiency. The sampler is still exploring the posterior distribution, but the exploration will be slower and the autocorrelation higher (effective sample size lower) than if the maximum tree-depth were set higher.

Increase the maximum tree-depth.

8.5.3 Bayesian Fraction of Missing Information

This is rather technical. See Betancourt (2016).

8.6 Debugging Bayesian Computing

See A. Gelman, Carlin, et al. (2013, 270), …

Pick a reasonable value for the “true” parameter vector \(\theta^* \sim p(\theta)\). This should be a random draw from the prior, but if \(\theta\) is uninformative, then any reasonable value of \(\theta\) should work.
Draw all other parameters that are conditional on \(\theta^*\).
Simulate a large fake dataset \(y^{fake}\) from the data distribution \(p(y | \theta^*)\)
Calculate the posterior \(p(\theta | y^{rep})\).
Compare the posterior distribution \(p(\theta | y)\) to its true value \(\theta^*\). For example there should be a 50% probability that the 50% posterior interval of \(p(\theta | y)\) contains \(\theta^*\).

Notes

for large number of parameters: calculate the proportion of 50% credible intervals that contain the true value.
residual plot: for all scalar parameters \(\theta_j\), calculate the residual \(\E[\theta_j] - \theta^*\). The residuals should have mean 0.
for models with few parameters, perform many fake data simulations.
convergence and posterior predictive tests may also indicate issues with the model.
- it could be either a problem with the model, or a problem in computation.
- simplify the model by removing parameters or setting them to fixed values
Start with a simple model and build complexity. This is useful both because the simple model may be “good enough” for your purposes, and because adding one part at a time makes it easier to catch and fix bugs.

This is also the case in optimization with non-convex objective functions.↩