\[ \DeclareMathOperator{\E}{E} \DeclareMathOperator{\mean}{mean} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Bias}{Bias} \DeclareMathOperator{\MSE}{MSE} \DeclareMathOperator{\RMSE}{RMSE} \DeclareMathOperator{\sd}{sd} \DeclareMathOperator{\se}{se} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator{\median}{median} \DeclareMathOperator{\rank}{rank} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator{\logistic}{Logistic} \DeclareMathOperator{\logit}{Logit} \newcommand{\mat}[1]{\boldsymbol{#1}} \newcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\T}{'} % This follows BDA \newcommand{\dunif}{\mathsf{Uniform}} \newcommand{\dnorm}{\mathsf{Normal}} \newcommand{\dhalfnorm}{\mathrm{HalfNormal}} \newcommand{\dlnorm}{\mathsf{LogNormal}} \newcommand{\dmvnorm}{\mathsf{Normal}} \newcommand{\dgamma}{\mathsf{Gamma}} \newcommand{\dinvgamma}{\mathsf{InvGamma}} \newcommand{\dchisq}{\mathsf{ChiSquared}} \newcommand{\dinvchisq}{\mathsf{InvChiSquared}} \newcommand{\dexp}{\mathsf{Exponential}} \newcommand{\dlaplace}{\mathsf{Laplace}} \newcommand{\dweibull}{\mathsf{Weibull}} \newcommand{\dwishart}{\mathsf{Wishart}} \newcommand{\dinvwishart}{\mathsf{InvWishart}} \newcommand{\dlkj}{\mathsf{LkjCorr}} \newcommand{\dt}{\mathsf{StudentT}} \newcommand{\dhalft}{\mathsf{HalfStudentT}} \newcommand{\dbeta}{\mathsf{Beta}} \newcommand{\ddirichlet}{\mathsf{Dirichlet}} \newcommand{\dlogistic}{\mathsf{Logistic}} \newcommand{\dllogistic}{\mathsf{LogLogistic}} \newcommand{\dpois}{\mathsf{Poisson}} \newcommand{\dBinom}{\mathsf{Binomial}} \newcommand{\dmultinom}{\mathsf{Multinom}} \newcommand{\dnbinom}{\mathsf{NegativeBinomial}} \newcommand{\dnbinomalt}{\mathsf{NegativeBinomial2}} \newcommand{\dbetabinom}{\mathsf{BetaBinomial}} \newcommand{\dcauchy}{\mathsf{Cauchy}} \newcommand{\dhalfcauchy}{\mathsf{HalfCauchy}} \newcommand{\dbernoulli}{\mathsf{Bernoulli}} \newcommand{\R}{\mathbb{R}} \newcommand{\Reals}{\R} \newcommand{\RealPos}{\R^{+}} \newcommand{\N}{\mathbb{N}} \newcommand{\Nats}{\N} \newcommand{\cia}{\perp\!\!\!\perp} \DeclareMathOperator*{\plim}{plim} \DeclareMathOperator{\invlogit}{Inv-Logit} \DeclareMathOperator{\logit}{Logit} \DeclareMathOperator{\diag}{diag} \]

6 Estimation

6.1 Point Estimates

Bayesian point estimators use the following recipe:

  1. define a loss function that penalizes guesses
  2. take the expected value of that loss function over the parameter of interest

Let \(\theta\) be a parameter with a prior distribution \(\pi\). Let \(L(\theta, \hat{\theta})\) be a loss function. Examples of loss-functions include

  • squared error: \((\theta - \hat{\theta})^2\)
  • absolute error: \(|\theta - \hat{\theta}|\)

Let \(\hat{\theta}(x)\) be an estimator. The Bayes risk of \(\hat{\theta}\) is the expected value of the loss function over the probability distribution of \(\theta\). \[ \E_{\pi}(L(\theta, \hat{\theta})) = \int L(\theta, \hat{\theta}) \pi(\theta) d\,\theta \]

An estimator is a Bayes estimator if it minimizes the the Bayes risk over all estimators.

Estimator Loss Function
Mean \((\theta - \hat{\theta})^2\)
Median \(|\theta - \hat{\theta}|\)
\(p\)-Quantile \(\begin{cases} p | \theta - \hat{\theta} | & \text{for } \theta - \hat{\theta} \geq 0 \\ (1 - p) |\theta - \hat{\theta} | & \text{for } \theta - \hat{\theta} < 0 \end{cases}\)
Mode \(\begin{cases} 0 & \text{for } | \theta = \hat{\theta} | < \epsilon \\ 1 | & \text{for } |\theta - \hat{\theta}| > \epsilon \end{cases}\)

\end{cases}$

The posterior mode can often be directly estimated by maximizing the posterior distribution, \[ \hat{\theta}_{\text{mode}} = \arg \max_{\theta} \int \theta(\theta | y) . \] This is called maximum a posteriori (MAP) estimation.

Given a estimation could be used by including that loss function into that function, \[ \hat\theta = \arg \min_{\theta^*} \int L(theta, \theta^*) p(\theta) \,d\theta . \] However, since that still requires integrating over the distribution of \(\theta\). In cases where the form of \(p(\theta)\) is known, this may have a closed form. In the cases of most posterior distributions, this would require some sort of approximation of the distribution of \(p(\theta)\).

\[ p(\theta | ) \]

6.2 Credible Intervals

A \(p \times 100\) credible interval of a parameter \(\theta\) with distribution \(f(\theta)\) is an interval \((a, b)\) such that, \[ CI(\theta | p) = (a, b) s.t. \int_{a}^{b} f(\theta) \,d\theta = p. \]

The credible interval is not uniquely defined. There may be multiple intervals that satisfy the definition of a credible interval. The most common are the

  • equal-tailed interval: \((F^{-1}(p / 2), F^{-1}((1 - p) / 2))\) where \(F^{-1}\) is the quantile function of \(\theta\). The 95% credible interval would use the 2.5% and 97.5% quantiles.

  • highest posterior density interval: The shortest credible interval. If the distribution is not unimodal and symmetric, the HPD interval is not the same as the equal-tailed interval.

Generally it is fine to use the equal-tailed interval. This is what Stan reports by default.

The HPD

  • is harder to calculate.

  • may not be a convex interval if it is a multimodal distribution. (Though that may be a desirable feature.)

  • unlike the central interval, not invariant under transformation. For some function \(g\), if \(CI_{HPD}(\theta) = (a, b)\), then generally \(CI(g(\theta)) \neq (g(a), g(b))\). However, for the equal-tailed interval, \(CI(g(\theta)) = (g(a), g(b))\).

How to calculate?

For the equal-tailed credible interval:

  • If the quantile function is known, use that.
  • Otherwise, calculate the quantiles of the sample.

For example, the 95% credible interval for a standard normal distribution is,

p <- 0.95
qnorm(c( (1 - p) / 2, 1 - (1 - p) / 2))
#> [1] -1.96  1.96

The 95% credible interval for a sample drawn from a normal distribution is,

quantile(rnorm(100), prob = c( (1 - p) / 2, 1 - (1 - p) / 2))
#>  2.5% 97.5% 
#> -1.78  1.79

There are multiple functions in multiple packages that calculate the HPD interval. coda::HPDitnterval.

6.2.1 Compared to confidence intervals

TODO

6.3 Bayesian Decision Theory

One aspect of Bayesian inference is that it separates inference from decision.

  1. Estimate a posterior distribution \(p(\theta | y)\)
  2. Define a loss function for an action (\(a\)) and parameter (\(\theta\)), \(L(a, \theta)\).
  3. Choose that action that minimizes the loss function

In this framework, inference is a subset of decisions and estimators are a subset of decision rules. Choosing an estimate is an action that aims the minimize the loss of function of guessing a parameter value.

Given \(theta\) and its distribution \(p(\theta)\) and a loss function \(L(a, \theta)\), the optimal action \(a^*\) from a set of actions \(\mathcal{A}\) is \[ a^* \arg \min_{a \in \mathcal{A}} \int p(\theta) L(a, \theta) \,d \theta . \] If we only have a sample of size \(S\) from \(p(\theta)\) (as in a posterior distribution estimated by MCMC), the optimal decision would be calculated as: \[ a^* \arg \min_{a \in \mathcal{A}} \sum_{s = 1}^S L(a, \theta_s) . \]

The introductions of the Talking Machines episodes The Church of Bayes and Collecting Data and have a concise discussion by Neil Lawrence on the pros and cons of Bayesian decision making.