\[ \DeclareMathOperator{\E}{E} \DeclareMathOperator{\mean}{mean} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Bias}{Bias} \DeclareMathOperator{\MSE}{MSE} \DeclareMathOperator{\RMSE}{RMSE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\se}{se} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\median}{median} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\logistic}{Logistic} \DeclareMathOperator{\logit}{Logit} \newcommand{\mat}[1]{\boldsymbol{#1}} \newcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\T}{'} % This follows BDA \newcommand{\dunif}{\mathsf{Uniform}} \newcommand{\dnorm}{\mathsf{Normal}} \newcommand{\dhalfnorm}{\mathrm{HalfNormal}} \newcommand{\dlnorm}{\mathsf{LogNormal}} \newcommand{\dmvnorm}{\mathsf{Normal}} \newcommand{\dgamma}{\mathsf{Gamma}} \newcommand{\dinvgamma}{\mathsf{InvGamma}} \newcommand{\dchisq}{\mathsf{ChiSquared}} \newcommand{\dinvchisq}{\mathsf{InvChiSquared}} \newcommand{\dexp}{\mathsf{Exponential}} \newcommand{\dlaplace}{\mathsf{Laplace}} \newcommand{\dweibull}{\mathsf{Weibull}} \newcommand{\dwishart}{\mathsf{Wishart}} \newcommand{\dinvwishart}{\mathsf{InvWishart}} \newcommand{\dlkj}{\mathsf{LkjCorr}} \newcommand{\dt}{\mathsf{StudentT}} \newcommand{\dhalft}{\mathsf{HalfStudentT}} \newcommand{\dbeta}{\mathsf{Beta}} \newcommand{\ddirichlet}{\mathsf{Dirichlet}} \newcommand{\dlogistic}{\mathsf{Logistic}} \newcommand{\dllogistic}{\mathsf{LogLogistic}} \newcommand{\dpois}{\mathsf{Poisson}} \newcommand{\dBinom}{\mathsf{Binomial}} \newcommand{\dmultinom}{\mathsf{Multinom}} \newcommand{\dnbinom}{\mathsf{NegativeBinomial}} \newcommand{\dnbinomalt}{\mathsf{NegativeBinomial2}} \newcommand{\dbetabinom}{\mathsf{BetaBinomial}} \newcommand{\dcauchy}{\mathsf{Cauchy}} \newcommand{\dhalfcauchy}{\mathsf{HalfCauchy}} \newcommand{\dbernoulli}{\mathsf{Bernoulli}} \newcommand{\R}{\mathbb{R}} \newcommand{\Reals}{\R} \newcommand{\RealPos}{\R^{+}} \newcommand{\N}{\mathbb{N}} \newcommand{\Nats}{\N} \newcommand{\cia}{\perp\!\!\!\perp} \DeclareMathOperator*{\plim}{plim} \DeclareMathOperator{\invlogit}{Inv-Logit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\diag}{diag} \]

Appendix

Prerequisites

library("tidyverse")
library("stringr")
library("bayz")

20.14 Parameters

Category	Description
modeled data	Data, assigned distribution
unmodeled data	Data not given a distribution
modeled parameters	Parameters with an informative prior distribution
unmodeled parameters	Parameters with non-informative prior distribution
derived quantities	Variables defined deterministicically

See A. Gelman and Hill (2007, 366)

20.15 Miscellaneous Mathematical Background

20.15.1 Location-Scale Families

In a location-scale family of distributions, if the random variable \(X\) is distributed with mean 0 and standard deviation 1, then the random variable \(Y\), \[ Y = \mu + \sigma X , \] has mean \(\mu\) and standard deviation \(\sigma\).

Normal distribution: Suppose \(X \sim \dnorm(0, 1)\), then \[ Y = \mu + \sigma X, \] is equivalent to \(Y \sim \dnorm(\mu, \sigma)\) (normal with mean \(\mu\) and standard deviation \(\sigma\)).

** Student-t distribution** (including Cauchy): \[ \begin{aligned}[t] X &\sim \dt{\nu}(0, 1) \\ Y &= \mu + \sigma X \end{aligned} \] implies \[ Y \sim \dt{\nu}(\mu, \sigma), \] i.e. \(Y\) is distributed Student-\(t\) with location \(\mu\) and scale \(\sigma\).

In Stan, it can be useful parameterize distributions in terms of a mean 0, scale 1 parameters, and separate parameters for the locations and scales. E.g. with normal distributions,

parameters {
  real mu;
  real<lower = 0.0> sigma;
  vector[n] eps;
}
transformed parameters {
  vector[n] y;
  y = mu + sigma * eps;
}
model {
  eps ~ normal(0.0, 1.0);
}

20.15.2 Scale Mixtures of Normal Distributions

Some commonly used distributions can be represented as scale mixtures of normal distributions. For formal details of scale mixtures of normal distributions see West (1987). Distributions that are scale-mixtures of normal distributions can be written as, \[ Y \sim \dnorm(\mu, \sigma_i^2) \\ \sigma_i \sim \pi(\sigma_i) \] As its name suggests, the individual variances (scales) themselves, have a distribution.

Some examples:

Student-t
Double Exponential
Horseshoe or Hierarchical Shrinkage (HS)
Horseshoe Plus or Hierarchical Shrinkage Plus (HS+)

Even when analytic forms of the distribution are available, representing them as scale mixtures of normal distributions may be convenient in modeling. In particular, it may allow for drawing samples from the distribution easily. And in HMC, it may induce a more tractable posterior density.

20.15.3 Covariance-Correlation Matrix Decomposition

The suggested method for modeling covariance matrices in Stan is the separation strategy which decomposes a covariance matrix \(\Sigma\) can be decomposed into a standard deviation vector \(\sigma\), and a correlation matrix \(R\) (Barnard, McCulloch, and Meng 2000), \[ \Sigma = \diag(\sigma) R \diag(\sigma) . \] This is useful for setting priors on covariance because separate priors can be set for the scales of the variables via \(\sigma\), and the correlation between them, via \(R\).

The rstanarm decov prior goes further and decomposes the covariance matrix into a correlation matrix, \(\mat{R}\), a diagonal variance matrix \(\mat{\Omega}\) with trace \(n \sigma^2\), a scalar global variance \(\sigma^2\), and a simplex \(\vec{\pi}\) (proportion of total variance for each variable): \[ \begin{aligned}[t] \mat{\Sigma} &= \mat{\Omega} \mat{R} \\ \diag(\mat{\Omega}) &= n \vec{\pi} \sigma^2 \end{aligned} \] Separate and interpretable priors can be put on \(\mat{R}\), \(\vec{\pi}\), and \(\sigma^2\).

The LKJ (Lewandowski, ) distribution is a distribution over correlation coefficients, \[ R \sim \dlkjcorr(\eta) , \] where \[ \dlkjcorr(\Sigma | \eta) \propto \det(\Sigma)^{(\eta - 1)} . \]

This distribution has the following properties:

\(\eta = 1\): uniform correlations
\(\eta \to \infty\): approaches the identity matrix
\(0 < \eta < 1\): there is a trough at the identity matrix with higher probabilities placed on non-zero correlations.
For all positive \(\eta\) (\(\eta > 0\)), \(\E(R) = \mat{I}\).

lkjcorr_df <- function(eta, n = 2) {
  out <- as.data.frame(rlkjcorr(n, eta))
  out$.row <- seq_len(nrow(out))
  out <- gather(out, .col, value, -.row)
  out$.col <- as.integer(str_replace(out$.col, "^V", ""))
  out$eta <- eta
  out  
}

lkjsims <- purrr::map_df(c(0.01, 0.1, 1, 2, 50, 1000), lkjcorr_df, n = 50)

This simulates a single matrix from the LKJ distribution with different values of \(\eta\). As \(\eta \to \infty\), the off-diagonal correlations tend towards 0, and the correlation matrix to the identity matrix.

ggplot(lkjsims,
       aes(x = .row, y = .col, fill = value)) +
  facet_wrap(~ eta, ncol = 2) +
  scale_fill_distiller(limits = c(-1, 1), type = "div", palette = "RdYlBu") +
  geom_raster() +
  theme_minimal() +
  theme(panel.grid = element_blank(), axis.text = element_blank()) +
  labs(x = "", y = "")

The density of the off-diagonal correlations.

lkjsims %>%
  filter(.row < .col) %>%
  ggplot(aes(x = value, colour = factor(eta))) +
  geom_density()

For other discussions of the LKJ correlation distribution, see these:

20.15.4 QR Factorization

For a full-rank \(N \times K\) matrix, the QR factorization is \[ \mat{X} = \mat{Q} \mat{R} \] where \(\mat{Q}\) is an orthonormal matrix such that \(\mat{Q}\T \mat{Q}\) and \(\mat{R}\) is an upper triangular matrix.

Stan function Stan Development Team (2016) suggest writing it is \[ \begin{aligned}[t] \mat{Q}^* = \mat{Q} \times \sqrt{N - 1} \\ \mat{R}^* = \frac{1}{\sqrt{N - 1}} \mat{R} \end{aligned} \]

This is used for solving linear model.

Suppose \(\vec{\beta}\) is a \(K \times 1\) vector, then \[ \vec{eta} = \mat{x} \vec{\beta} = \mat{Q} \mat{R} \vec{\beta} = \mat{Q}^* \mat{R}^* \vec{\beta} . \] Suppose \(\mat{theta} = \mat{R}^* \vec{\beta}\), then \(\vec{eta} = \mat{Q}^* \mat{\theta}\) and \(\vec{beta} = {\mat{R}^*}^{-1} \mat{\theta}\).

rstanarm provides a prior for a normal linear model which uses the QR decomposition to parameterize a prior in terms of \(R^2\).

Stan functions:

qr_Q(matrix A)
qr_R(matrix A)

See Stan Development Team (2016 Sec 8.2)

20.15.5 Cholesky Decomposition

The Cholesky decomposition of a positive definite matrix \(A\) is, \[ \mat{A} = \mat{L} \mat{L}\T , \] where \(\mat{L}\) is a lower-triangular matrix.

It is similar to a square root for a matrix.
It often more numerically stable or efficient to work with the Cholesky decomposition, than with a covariance matrix. When working with the covariance matrix, numerical precision can result in a non positive definite matrix. However, working with \(\mat{L}\) will ensure that \(\mat{A} = \mat{L} \mat{L}\T\) will be positive definite.
In Stan
- Types types cholesky_factor_cov, and cholesky_factor_corr represent the Cholesky factor of covariance and correlation matrices, respectively.
- Cholesky decomposition function is cholesky_decompose(matrix A)
Multiple functions in Stan are parameterized with Cholesky decompositions instead of or in addition to covariance matrices. Use them if possible; they are more numerically stable.
- lkj_corr_chol_lpdf
- multi_normal_cholesky_lpdf

The Cholesky factor is used for sampling from a multivariate normal distribution using i.i.d. standard normal distributions. Suppose \(X_1, \dots, X_N\) are \(N\) i.i.d. standard normal distributions, \(\mat{\Omega}\) is an \(N \times N\) lower-triangular matrix such that \(\mat{\Omega} \mat{Omega}\T = \mat{\Sigma}\), and \(\mu\) is an \(N \times 1\) vector, then \[ \vec{\mu} + \mat{\Omega} X \sim \dnorm(\vec{\mu}, \mat{\Sigma}) \]

See Stan Development Team (2016, 40, 147, 241, 246)

20.16 Scaled and Unscaled Variables

Though priors shouldn’t depend on the data itself, many priors depends

Suppose \(\tilde{Y}\), \(\tilde{X}\), \(\tilde{\alpha}\), \(\tilde{\beta}\), and \(\epsilon\) are random variables, such that \[ \tilde{Y} = \tilde{\alpha} + \tilde{\beta} \tilde{X} + \epsilon . \] These random variables have the following properties, \[ \begin{aligned}[t] \tilde{Y} &= \frac{Y - \bar{Y}}{\sigma_Y}, & \E[\tilde{Y}] &= 0, & \sigma_Y^2 &= \Var[\tilde{Y}] = 1 \\ \tilde{X} &= \frac{X - \bar{X}}{\sigma_X}, & \E[\tilde{X}] &= 0, & \sigma_X^2 &= \Var[\tilde{X}] = 1 , \\ && \E[\epsilon] &= 0 & \sigma_{\tilde{\epsilon}}^2 &= \Var[\tilde{\epsilon}] \end{aligned} \] where \[ \begin{aligned}[t] \bar{X} &= \E[X] , & s_X^2 &= \Var[X] , \\ \bar{Y} &= \E[Y] , & s_Y^2 &= \Var[Y] . \end{aligned} \]

Then via some algebra, \[ \begin{aligned} Y &= \underbrace{\sigma_{Y} \tilde{\alpha} + \bar{Y} - \frac{\sigma_Y }{\sigma_X} \tilde{\beta} \bar{X}}_{\alpha} + \underbrace{\frac{\sigma_Y}{\sigma_X} \tilde{\beta}}_{\beta} X + \underbrace{\sigma_Y \tilde{\epsilon}}_{\epsilon} \\ &= \alpha + \beta X + \epsilon . \end{aligned} \] The primary relationships of interest are those between \(\alpha\) and \(\tilde{\alpha}\), \(\beta\) and \(\tilde{\beta}\), and \(\epsilon\) and \(\tilde{\epsilon}\). These can be used to convert between coefficients estimated with standardized data to the coefficients on the data scale, or to adjust scale-free weakly informative priors to the data scale. \[ \begin{aligned}[t] \tilde{\alpha} &= \sigma_Y^{-1}\left(\alpha - \bar{Y} + \beta \bar{X} \right) \\ &= \sigma_Y^{-1}\left(\alpha - \bar{Y} + \frac{\sigma_Y}{\sigma_X} \tilde{\beta} \bar{X} \right), \\ \alpha &= \sigma_Y \tilde{\alpha} + \bar{Y} - \frac{\sigma_Y}{\sigma_X} \tilde{\beta} \bar{X} \\ &= \sigma_Y \tilde{\alpha} + \bar{Y} - \beta \bar{X} , \\ \tilde{\beta} &= \frac{\sigma_X}{\sigma_Y} \beta , \\ \beta &= \frac{\sigma_Y}{\sigma_X} \tilde{\beta} , \\ \tilde{\epsilon} &= \epsilon / \sigma_Y , \\ \epsilon &= \sigma_Y \tilde{\epsilon} . \end{aligned} \] This implies the following relationships between their means and variances, \[ \begin{aligned}[t] E(\alpha) &= \sigma_{Y} E(\tilde{\alpha}) + \bar{Y} - \frac{\sigma_Y}{\sigma_X} E(\tilde{\beta}) \bar{X} , & V(\alpha) &= \sigma_{Y}^2 V(\tilde{\alpha}) + \frac{\sigma_Y^2}{\sigma_X^2} V(\tilde{\beta}) \bar{X}^2 , \\ &= \sigma_{Y} E(\tilde{\alpha}) + \bar{Y} - \bar{X} E(\beta) , & &= \sigma_{Y}^2 V(\tilde{\alpha}) + \bar{X}^2 V(\beta) , \\ E(\tilde{\alpha}) &= \frac{E(\alpha) - \bar{Y} + E(\beta) \bar{X} }{\sigma_Y} , & V(\tilde{\alpha}) &= \sigma_Y^{-2} \left[ V(\alpha) + \bar{X}^2 V(\beta) \right] \end{aligned} \] For example, a weakly informative prior on \(\tilde{\alpha}\) implies a prior on \(\alpha\), \[ \tilde{\alpha} \sim N(0, 10^2) \to \alpha \sim N \left( \frac{\beta \bar{X} - \bar{Y}}{\sigma_Y}, \sigma_Y^2 10^2 \right) . \]

\[ \begin{aligned}[t] E(\beta) &= \frac{\sigma_Y}{\sigma_X} E(\tilde{\beta}) , & V(\beta) &= \frac{\sigma_Y^2}{\sigma_X^2} V(\tilde{\beta}) , \\ E(\tilde{\beta}) &= \frac{\sigma_X}{\sigma_Y} E(\beta) , & V(\tilde{\beta}) &= \frac{\sigma_X^2}{\sigma_Y^2} V(\beta) . \end{aligned} \]

For example, a weakly informative prior on \(\tilde{\beta}\) implies the following prior on \(\beta\), \[ \tilde{\beta} \sim N(0, 2.5^2) \to \beta \sim N\left(0, \frac{\sigma_Y^2}{\sigma_X^2} 2.5^2 \right) . \]

\[ \begin{aligned}[t] E(\epsilon) &= 0 , & V(\epsilon) &= \sigma_Y^2 V(\tilde{\epsilon}), \\ E(\tilde{\epsilon}) &= 0 , & V(\tilde{\epsilon}) &= \sigma_Y^{-2} V(\epsilon) . \end{aligned} \] For example, a weakly informative prior on the variance of \(\tilde{\epsilon}\) implies a weakly informative prior on the variance of \(\epsilon\), \[ \sigma_{\tilde{\epsilon}} \sim C^{+}\left(0, 5 \right) \to \sigma_{\epsilon} \sim C^{+}\left(0, 5 \sigma_Y \right) . \]