1 Bayesian Inference
new shit has come to light, man – The Dude (The Big Lebowski)
Statistical inference is the process of using observed data to infer properties of the statistical distributions that generated that data.
Bayesian inference is the process of fitting a probability model to a set of data and summarizing the result by a probability distribution on the parameters of the model and on unobserved quantities such as predictions for new observations.
The motivation of statistical inference is the learn about unknown quantities (parameters) of a process from data that was generated by it. In other words, the quantity of interest in statistical inference is \[ \Pr(\text{parameters} | \text{data}) . \] This conditional distribution is called the posterior distribution. Bayesian inference answers this question by appealing to Bayes’ Theorem, \[ \underbrace{\Pr(\text{parameters} | \text{data})}_{\text{posterior}} = \frac{\overbrace{\Pr(\text{data} | \text{parameters})}^{\text{likelihood}} \overbrace{\Pr(\text{parameters})}^{\text{prior}}}{\underbrace{\Pr(\text{data})}_{evidence}} . \]
The remainder of these notes discusses how to apply this to problems.
1.1 Bayesian Analysis
The three steps of Bayesian analysis (A. Gelman, Carlin, et al. 2013, 3) are
Modeling: define a full probability model which incorporates all observable and unobservable quantities of the problem. To the extent possible, the model should incorporate all relevant data about the underlying problem and data generating problem.
Estimation: given data, estimate the parameters for the posterior distribution defined in the modeling step.
Evaluation: given the posterior distribution, evaluate the fit the model to the data, or the prediction of new data. If it is insufficient, go back to step one.
One of the nice features of Bayesian analysis is that in theory it clearly separates the modeling step from the estimation step. However, in application, the difficulties of computing the posterior distribution have meant that the computational and modeling steps have often been tightly coupled. Yet, new algorithms, improvements in computational capacity, and new software have started to make it possible to black-box the estimation stage.
1.2 Posterior Predictive Distribution
The posterior predictive distribution is the the probability of observing new data (\(y^{eval}\)) given the posterior distribution of the model parameters after observing training data, \(p(\theta | y^{train})\). \[ p(y^{eval} | y^{train}) = \int p(y^{eval} | \theta) p(\theta | y^{train})\,d \theta . \tag{1.1} \]
Many tradition statistical or machine learning methods proceed by estimating a “best” value of the parameters using training data, and then predicting evaluating data using that parameter. For example, we could calculate the maximum a posteriori estimate of of \(\theta\) given the training data, \[ \hat{\theta} = \arg \max_{\theta} p(\theta | y^{train}) , \] and then use that for the distribution of evaluation data, \[ p(y^{eval} | y^{train} \approx p(y^{eval} | \hat{\theta}) . \] However, this does not incorporate the uncertainty in the estimates of \(\theta\). The full form of the posterior predictive distribution in Equation \tag{1.1} incorporates the uncertainty about \(\theta\) into the distribution of \(p(y^{eval} | \theta)\).