Applied Bayesian Data Analysis

Finite mixture models

Concepts:

indicators
identifiability
mixture components
label switching

David Tolpin, david.tolpin@gmail.com

Setting up mixture models

Population consists of subpopulations
Like hierarchical models, only we groups are not given
Random indicators are used to specify subpopulation (unknown) of each observation
Subpopulations == components

Finite mixtures

We model distribution of $y=(y_1, ..., y_n)$ as a mixture of $H$ components
Component distribution $f_h(y_i|\theta_h)$ depends on parameters $\theta_h$
Proportion of population from $h$th component is $\lambda_h$, $\sum_{h=1}^H\lambda_h=1$

Sampling distribution

The sampling distribution: $$p(y_i|\theta, \lambda) = \lambda_1 f(y_i|\theta_1) + ...+ \lambda_H f(y_i|\theta_H)$$

LogSumExp: $$\log p(y_i|\theta, \lambda) = \log (\lambda_1 \exp (\log f(y_i|\theta_1)) + ... $$ $$+ \lambda_H \exp(\log f(y_i|\theta_H)))$$

Sampling from sampling distribution

For each $i$:

$z_i \sim Categorical(\lambda)$
$y_{i} \sim F(\theta_{z_i})$

Identifiability of the mixture model

Parameters are not identified if different parameters result in the same likelihood
Mixture models are unidentifiable because labels can be switched (GMM in Stan example)
How to fix:
- specify order of mixture components or mixture weights
- hierarchical mixture models

Number of mixture components

How many componentst?

Guess (2 components for heights of humans)
Try different values and compare (Chapter 7)
Infer - $H \sim D$; what should be D

Philosophy: meaning of mixture models

One opinion: mixture models learn latent true structure
Another opinion: mixture models approximate multi-modal distributions
Both

Example: reaction time in schizophrenia

Dataset http://www.stat.columbia.edu/~gelman/book/data/schiz.asc

Response times measured for 11 non-schizophrenics and 6 schizophrenics
Schizophrenics are
- slower to respond
- sometimes lack attention
⇒ Hierarchical model with mixture for schizophrenics

Reaction time: Data

Reaction time: Parameters

$x_j$ — schizophrenic, $y_{ij}$ — response time
$\lambda$ — probability of delay
$\tau$ — delay
$\alpha$ — response time without delay
$\mu$ — average response time
$\beta$ — slow down in schizophrenics

Reaction time: model

for j in patients: $\alpha \sim \mathcal{N}(\mu, \sigma^2_\alpha)$ if $x_j$: # schizophrenic for i in trials: $z \sim \mathrm{Bernoulli}(\lambda)$ if $z$: # lack of attention $y_{ij} \sim \mathcal{N}(\alpha + \beta + \tau, \sigma^2_y)$ else: $y_{ij} \sim \mathcal{N}(\alpha + \beta, \sigma^2_y)$ else: for i in trials: $y_{ij} \sim \mathcal{N}(\alpha, \sigma^2_y)$

Readings

href="http://www.stat.columbia.edu/~gelman/book/">Bayesian Data Analysis — chapter 22: Discrete mixture models

Statistical rethinking — chapter 12: Monsters and mixtures