Applied Bayesian Data Analysis
Priors
David Tolpin, david.tolpin@gmail.com
Concepts
- Conjugacy
- Informative, non-informative, semi-informative priors
- Pivotal quantities
Conjugacy
A boring mathematical concept:
Example: binomial model
\begin{aligned}
\theta &\sim \mathrm{Prior} \\
y_{1:n}&\sim \mathrm{Bernoulli}(\theta)
\end{aligned}
How to choose $\mathrm{Prior}$?
- $p(\theta|y) \propto p(y, \theta) = p(\theta)p(y|\theta)$
- $p(y_{1:n}|\theta) = \theta^k(1 - \theta)^{n-k}$
- if $p(\theta) \propto \theta^a(1-\theta)^b$
then $p(\theta|y) \propto \theta^{a + k}(1 - \theta)^{b + n - k}$ — same form
Example: binomial model
- $\mathrm{Beta}(\theta|\alpha, \beta) = \frac 1 {\mathrm{B}(\alpha, \beta)} \theta^{\alpha-1} (1 - \theta)^{\beta - 1}$
- $\mathrm{Beta}(\alpha, \beta)$ is the conjugate prior for $\mathrm{Bernoulli}(\theta)$
- $\alpha$ — number of ‘prior’ successes (heads),
- $\beta$ — number of ‘prior’ failures (tails).
- $\alpha=\beta=1$ — uniform $[0, 1]$ prior.
Exponential families
- $\mathcal{F}$ is an exponential family if
$$p(y_i|\theta) = f(y_i)g(\theta)e^{\phi(\theta)^\top u(y_i)}$$
- $\phi(\theta)$ — natural parameter
- likelihood of set $y=(y_1, ..., y_n)$ is
$$p(y|\theta) \propto g(\theta)^ne^{\phi(\theta)^Tt(y)}$$
where $t(y) = \sum_{i=1}^n u(y_i)$
- $t(y)$ is a sufficient statistics for $\theta$, all we need to know about the data
Exp. family conjugates
- If $p(\theta) \propto g(\theta)^\eta e^{\phi(\theta)^T\nu}$,
- then $p(\theta|y) \propto g(\theta)^{\eta+n}e^{\phi(\theta)^T(\nu+t(y))}$.
- $p(\theta|y)$ has the same form, so $p(\theta)$ is conjugate to $p(y|\theta)$.
Exp. family members
- Bernoulli
- Normal, $\propto \frac 1 \sigma e^{-\frac 1 {2\sigma^2} {(x-\mu)^2}}$
- Poisson, $\propto \theta^y e^{-\theta}$
- Exponential, $\propto \theta e^{-y\theta}$
- ...
Specifying priors
- Prior $p(\theta) = \int_Y p(\theta|y)p(y)dy$ is marginal of $\theta$ over all possible observations.
- Posterior is a compromise between prior and conditional:
- $\mathbb{E}(\theta) = \mathbb{E}(\mathbb{E}(\theta|y))$
- $\mathbb{E}(\mathbb{var}(\theta)) = \mathbb{E}(\mathbb{var}(\theta|y)) + \mathbb{var}(\mathbb{E}(\theta|y))$
- $\mathbb{E}(\mathbb{var}(\theta|y))$ — ‘unexplained’ variation
- $\mathbb{var}(\mathbb{E}(\theta|y))$ — ‘explained’ variation
- Posterior variance is on average smaller than prior
- If posterior variance is greater, look for a problem
Informative priors
- Prior defines the ‘population’
- Or, prior defines the ‘state of knowledge’
- Example: coin flip
- 9+1 coins from the same batch
- 5 fell on heads, 4 on tails on a single toss
- Prior for the 10th coin: $\mathrm{Beta}(5, 4)$
Non-informative priors
- No prior information, (almost) all distributions are possible
- May be also used for ‘regularization’ that is, making model work
- Examples:
- $\mathrm{Beta}(1, 1)$ — uniform prior
- $\mathrm{Normal}(0, 1000)$ — regularization
Non-informative priors
Pivotal quantity, location, scale
- Location:
- $p(y-\theta|\theta) = f(u)$, $u = y - \theta$
- $y - \theta$ — pivotal quantity, $\theta$ — location parameter
- $p(\theta) \propto C$
- Scale:
- $p(\frac y \theta|\theta) = f(u)$, $u = \frac y \theta$
- $\frac y \theta$ — pivotal quantity, $\theta$ - scale parameter
- $p(\log \theta) \propto C$, $p(\theta) \propto \frac 1 \theta$
Weakly-informative priors
- Some information
- Less information than in the data
- Examples:
- Covered by water: $\mathrm{Uniform}(0.5, 1)$
- Salary: $\mathrm{Exponential}(11\,500₪)$