Applied Bayesian Data Analysis

Priors

A boring mathematical concept:

$\mathcal{F}$ — class of sampling distributions $p(y|\theta)$
$\mathcal{P}$ — class of prior distributions for $\theta$
$\mathcal{P}$ is conjugate for $\mathcal{F}$ if $$p(\theta|y) \in \mathcal{P}\mbox{ for all }p(\cdot|\theta) \in \mathcal{F}\mbox{ and }p(\cdot) \in \mathcal{P}$$

There are natural families of $\mathcal{P}$

Conjugate priors/posteriors are interpretable

\begin{aligned} \theta &\sim \mathrm{Prior} \\ y_{1:n}&\sim \mathrm{Bernoulli}(\theta) \end{aligned}

How to choose $\mathrm{Prior}$?

$p(\theta|y) \propto p(y, \theta) = p(\theta)p(y|\theta)$
$p(y_{1:n}|\theta) = \theta^k(1 - \theta)^{n-k}$
if $p(\theta) \propto \theta^a(1-\theta)^b$
then $p(\theta|y) \propto \theta^{a + k}(1 - \theta)^{b + n - k}$ — same form

$\mathrm{Beta}(\theta|\alpha, \beta) = \frac 1 {\mathrm{B}(\alpha, \beta)} \theta^{\alpha-1} (1 - \theta)^{\beta - 1}$
$\mathrm{Beta}(\alpha, \beta)$ is the conjugate prior for $\mathrm{Bernoulli}(\theta)$
- $\alpha$ — number of ‘prior’ successes (heads),
- $\beta$ — number of ‘prior’ failures (tails).
$\alpha=\beta=1$ — uniform $[0, 1]$ prior.

$\mathcal{F}$ is an exponential family if $$p(y_i|\theta) = f(y_i)g(\theta)e^{\phi(\theta)^\top u(y_i)}$$
$\phi(\theta)$ — natural parameter

likelihood of set $y=(y_1, ..., y_n)$ is $$p(y|\theta) \propto g(\theta)^ne^{\phi(\theta)^Tt(y)}$$ where $t(y) = \sum_{i=1}^n u(y_i)$

$t(y)$ is a sufficient statistics for $\theta$, all we need to know about the data

If $p(\theta) \propto g(\theta)^\eta e^{\phi(\theta)^T\nu}$,
then $p(\theta|y) \propto g(\theta)^{\eta+n}e^{\phi(\theta)^T(\nu+t(y))}$.
$p(\theta|y)$ has the same form, so $p(\theta)$ is conjugate to $p(y|\theta)$.

Prior $p(\theta) = \int_Y p(\theta|y)p(y)dy$ is marginal of $\theta$ over all possible observations.
Posterior is a compromise between prior and conditional:
- $\mathbb{E}(\theta) = \mathbb{E}(\mathbb{E}(\theta|y))$
- $\mathbb{E}(\mathbb{var}(\theta)) = \mathbb{E}(\mathbb{var}(\theta|y)) + \mathbb{var}(\mathbb{E}(\theta|y))$
  - $\mathbb{E}(\mathbb{var}(\theta|y))$ — ‘unexplained’ variation
  - $\mathbb{var}(\mathbb{E}(\theta|y))$ — ‘explained’ variation

Posterior variance is on average smaller than prior

If posterior variance is greater, look for a problem

Prior defines the ‘population’
Or, prior defines the ‘state of knowledge’
Example: coin flip
- 9+1 coins from the same batch
- 5 fell on heads, 4 on tails on a single toss
- Prior for the 10th coin: $\mathrm{Beta}(5, 4)$

No prior information, (almost) all distributions are possible
May be also used for ‘regularization’ that is, making model work
Examples:
- $\mathrm{Beta}(1, 1)$ — uniform prior
- $\mathrm{Normal}(0, 1000)$ — regularization

Location:
- $p(y-\theta|\theta) = f(u)$, $u = y - \theta$
- $y - \theta$ — pivotal quantity, $\theta$ — location parameter
- $p(\theta) \propto C$
Scale:
- $p(\frac y \theta|\theta) = f(u)$, $u = \frac y \theta$
- $\frac y \theta$ — pivotal quantity, $\theta$ - scale parameter
- $p(\log \theta) \propto C$, $p(\theta) \propto \frac 1 \theta$

Some information
Less information than in the data
Examples:
- Covered by water: $\mathrm{Uniform}(0.5, 1)$
- Salary: $\mathrm{Exponential}(11\,500₪)$

Bayesian Data Analysis — Chapter 2: Single-parameter models.
Statistical rethinking — Sections 2.3: Components of the model, 2.4: Making the model go.