\[ \DeclareMathOperator{\E}{E} \DeclareMathOperator{\mean}{mean} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Bias}{Bias} \DeclareMathOperator{\MSE}{MSE} \DeclareMathOperator{\RMSE}{RMSE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\se}{se} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\median}{median} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\logistic}{Logistic} \DeclareMathOperator{\logit}{Logit} \newcommand{\mat}[1]{\boldsymbol{#1}} \newcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\T}{'} % This follows BDA \newcommand{\dunif}{\mathsf{Uniform}} \newcommand{\dnorm}{\mathsf{Normal}} \newcommand{\dhalfnorm}{\mathrm{HalfNormal}} \newcommand{\dlnorm}{\mathsf{LogNormal}} \newcommand{\dmvnorm}{\mathsf{Normal}} \newcommand{\dgamma}{\mathsf{Gamma}} \newcommand{\dinvgamma}{\mathsf{InvGamma}} \newcommand{\dchisq}{\mathsf{ChiSquared}} \newcommand{\dinvchisq}{\mathsf{InvChiSquared}} \newcommand{\dexp}{\mathsf{Exponential}} \newcommand{\dlaplace}{\mathsf{Laplace}} \newcommand{\dweibull}{\mathsf{Weibull}} \newcommand{\dwishart}{\mathsf{Wishart}} \newcommand{\dinvwishart}{\mathsf{InvWishart}} \newcommand{\dlkj}{\mathsf{LkjCorr}} \newcommand{\dt}{\mathsf{StudentT}} \newcommand{\dhalft}{\mathsf{HalfStudentT}} \newcommand{\dbeta}{\mathsf{Beta}} \newcommand{\ddirichlet}{\mathsf{Dirichlet}} \newcommand{\dlogistic}{\mathsf{Logistic}} \newcommand{\dllogistic}{\mathsf{LogLogistic}} \newcommand{\dpois}{\mathsf{Poisson}} \newcommand{\dBinom}{\mathsf{Binomial}} \newcommand{\dmultinom}{\mathsf{Multinom}} \newcommand{\dnbinom}{\mathsf{NegativeBinomial}} \newcommand{\dnbinomalt}{\mathsf{NegativeBinomial2}} \newcommand{\dbetabinom}{\mathsf{BetaBinomial}} \newcommand{\dcauchy}{\mathsf{Cauchy}} \newcommand{\dhalfcauchy}{\mathsf{HalfCauchy}} \newcommand{\dbernoulli}{\mathsf{Bernoulli}} \newcommand{\R}{\mathbb{R}} \newcommand{\Reals}{\R} \newcommand{\RealPos}{\R^{+}} \newcommand{\N}{\mathbb{N}} \newcommand{\Nats}{\N} \newcommand{\cia}{\perp\!\!\!\perp} \DeclareMathOperator*{\plim}{plim} \DeclareMathOperator{\invlogit}{Inv-Logit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\diag}{diag} \]

9 Model Checking

9.1 Why check models?

  • In theory, a Bayesian model should include all relevant substantive knowledge and subsume all possible theories.
  • In practice, it won’t. We need to check how the model fits data.

The question is not whether a model is “true”; it isn’t (Box 1976). The question is whether it is good enough for the purposes of the analysis. The problem is how we can specify “good enough” criteria, and how we can check those criteria.

See Gelman, Meng, and Stern (1996), A. Gelman (2007), Gelman (2009), A. Gelman, Carlin, et al. (2013 Ch. 6), A. Gelman and Shalizi (2012b), Kruschke (2013), A. Gelman and Shalizi (2012a), Gelman (2014) for a more discussion of the motivation and use of posterior predictive checks.

9.2 Posterior Predictive Checks

One method evaluate the fit of a model is to use posterior predictive checks

  • Fit the model to the data to get the posterior distribution of the parameters: \(p(\theta | D)\)
  • Simulate data from the fitted model: \(p(\tilde{D} | \theta, D)\)
  • Compare the simulated data (or a statistic thereof) to the observed data and a statistic thereof. The comparison between data simulated from the model can be formal or visual.

Within a Stan function, this is done in the generated quantities block using a _rng distribution functions.

The package bayesplot includes multiple functions for posterior predictive checks; see the help for PPC-overview for a summary of these functions.

9.2.1 Bayesian p-values

A posterior predictive p-value is a the tail posterior probability for a statistic generated from the model compared to the statistic observed in the data. Let \(y = (y_1, \dots, y_n)\) be the observed data. Suppose the model has been fit and there is a set of simulation \(\theta^(s)\), \(s = 1, \dots, n_sims\). In replicated dataset, \(y^{rep(s)\), has been generated from the predictive distribution of the data, \(p(y^{(rep)} | \theta = \theta^{(s)}\). Then the ensemble of simulated datasets, \((y^{rep(s)}, \dots, y^{rep(nsims)})\), is a sample from the posterior predictive distribution, \(p(y^{(rep)} | y)\)

The model can be tested by means of discrepancy statistics, which are some function of the data and parameters, \(T(y, \theta)\). If \(\theta\) was known, then compare discrepancy by \(T(y^{(rep)}, \theta)\). The statistical significance is \(p = \Pr(T(y^{(rep)}, \theta) > T(y, \theta) | y, \theta)\). If \(\theta\) is unknown, then average over the posterior distribution of \(\theta\), \[ \begin{aligned}[t] p &= \Pr(T(y^{(rep)}, \theta) > T(y, \theta) | y) \\ &= \int Pr(T(y^{(rep)}, \theta) > T(y, \theta) | y, \theta) p(\theta | y) d\,\theta , \end{aligned} \] which is easily estimated from the MCMC samples as, \[ p = \frac{1}{n_{sims}}\sum_{s = 1}^{n_{sims}} 1( T(y^{rep(s)}, \theta(s)) > T(y, \theta(s))) \]

9.2.2 Test quantities

The definition of a posterior p-value does not specify a particular test-statistic, \(T\), to use.

The best advice is that \(T\) depends on the application.

  • A. Gelman, Carlin, et al. (2013, 146) Speed of light example uses the 90% interval (61st and 6th order statistics).

  • A. Gelman, Carlin, et al. (2013, 147) binomial trial example uses the number of switches (0 to 1, or 1 to 0) in order to test independence.

  • A. Gelman, Carlin, et al. (2013, 148) hierarchical model for adolescent smoking uses.

    • percent of adolescents in the sample who never smoked
    • percentage in the sample who smoked in all waves
    • percentage of “incident smoker”: adolescents who began the study and non-smokers and ended as smokers.

9.2.3 p-values vs. u-values

A posterior predictive p-value is different than a classical p-value.

  • Posterior predictive \(p\)-value

    • distributed uniform if the model is true.
  • Classical \(p\)-value

    • distributed uniform if the null hypothesis (\(H_0\)) is true.

A \(u\)-value* is any function of the data that has a \(U(0, 1)\) sampling distribution (A. Gelman, Carlin, et al. 2013, 151)

  • a \(u\)-value can be averaged over \(\theta\), but it is not Bayesian, and is not a probability distribution
  • posterior p-value: probability statement, conditional on model and data, about future observations

9.2.4 Marginal predictive checks

Compare statistics for each observation.

Conditional Predictive Ordinate (CPO): The CPO (Gelfand 1995) is the leave-one-out cross-validation predictive density: \[ p(y_i | y_{-i}) = \int p(y_i | \theta) p(\theta | y_{-i}) d\,\theta \] The pointwise predicted LOO probabilities can be calculated using PSIS-LOO or WAIC in the loo package.

Predictive Concordance and Predictive Quantiles Gelfand (1995) classifies any \(y_i\) that is outside the central 95% predictive posterior of \(y^{rep}_i\) is an outlier. Let the predictive quantile (\(PQ_i\)) be \[ PQ_i = p(y_i^{(rep)} > y_i) . \] Then the predictive concordance be the proportion of \(y_i\) that are not outliers. Gelfand (1995) argues that the predictive concordance should match 95%. In other words that the posterior predictive distribution should have the correct coverage.

9.2.5 Outliers

Can be identified by the inverse-CPO.

  • larger than 40 are possible outliers, and those higher than 70 are extreme values (Ntzoufras 2009, 376).
  • Congdon (2014) scales CPO by dividing each by its individual max and considers observations with scaled CPO under 0.01 as outliers.

9.2.6 Graphical Posterior Predictive Checks

Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it can’t surprise you. – paraphrase of Hadley Wickham

Instead of calculating posterior probabilities, plot simulated data and observed data and visually compare them. See A. Gelman, Carlin, et al. (2013, 154).

  • plot simulated data and real data (A. Gelman, Carlin, et al. 2013, 154). This is similar to ideas in Wickham et al. (2010).

  • plot summary statistics or inferences

  • residual plots

    • Bayesian residuals have a distribution \(r_i^{(s)} = y_i - \E(y_i \theta^{s})\)

    • Bayesian residual graph plots single realization of the residuals, or a summary of their posterior distributions

    • binned plots are best for discrete data (A. Gelman, Carlin, et al. 2013, 157)

9.3 References

See A. Gelman and Shalizi (2012b), A. Gelman and Shalizi (2012a), Kruschke (2013).