Expectation Maximization - EM

Problem

Given a statistical model $P(\boldsymbol{X}, \boldsymbol{Z} | \boldsymbol{\theta})$, which generate set of observations $\boldsymbol{X}$, where $\boldsymbol{Z}$ is a latent variable and unknow parameter vector $\boldsymbol{\theta}$. The goal is to find $\boldsymbol{\theta}$ that maximize the marginal likelihood:

$$ \mathcal{L}(\boldsymbol{\theta}; \boldsymbol{X}) = P(\boldsymbol{X} | \boldsymbol{\theta}) = \int_{\boldsymbol{Z}}P(\boldsymbol{X}, \boldsymbol{Z} | \boldsymbol{\theta})d\boldsymbol{Z} $$

As an example for this type of problem, there are two (unfair) coin A and B with probability of head for each coin is $p_A(H) = p \text{ and } p_B(H) = q$. For each trial, we select coin A with probability $p(A) = \tau$ and coin B with probability $p(B) = 1 -\tau$, toss the coin and record the observation. The set of observations $\boldsymbol{X}$ is the record of head or tail $\{H, T, H, H, \cdots\}$, the latent variable which is unobserved is which coint is selected for each trail $\{A, B, B, A, \cdots\}$, and the unknown parameter vector $\boldsymbol{\theta} = [p, q, \tau]$. The goal is to find $\boldsymbol{\theta}$ that best fit observations; EM is an instance of Maximum Likelihood Estimation (MLE).

The EM algorithm

The algorithm

The EM algorithm seeks for $\boldsymbol{\theta}$ by first initiates a random parameter vector $\boldsymbol{\theta}^{(0)}$ and then iteratively performs two steps, namely the expectation step (E step) and the maximization step (M step):

(The E step) the expected loglikelihood of $\boldsymbol{\theta}$, with respect to the current conditional distribution of $\boldsymbol{Z}$ given observations $\boldsymbol{X}$ and current estimation of $\boldsymbol{\theta}^{(t)}$

$$ Q(\boldsymbol{\theta} | \boldsymbol{\theta}^{(t)}) = \mathbb{E}_{\boldsymbol{Z} \sim P(. | \boldsymbol{X}, \boldsymbol{\theta}^{(t)})} {[ \log P(\boldsymbol{X}, \boldsymbol{Z} | \boldsymbol{\theta}) ]} $$

(The M step) update parameter vector $\boldsymbol{\theta}$

$$ \boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} Q(\boldsymbol{\theta} | \boldsymbol{\theta}^{(t)}) $$

Proof of correctness

Setup

We need to proof that updating parameter vector $\boldsymbol{\theta}$ by EM algorithm will monotonically increase the marginal likelihood of $P(X|\theta)$

$$ \log P(X|\theta^*) - \log P(X|\theta^{(t)}) \geq Q(\theta^* | \theta^{(t)}) - Q(\theta^{(t)} | \theta^{(t)}) \geq 0 $$ Where the second inequality come from $\theta^* = \arg\max_\theta Q(\theta | \theta^{(t)}) $

Proof

$$ \begin{aligned} & P(X, Z | \theta) = P(Z | X, \theta) P(X | \theta) & \text{\tiny(Bayes theorem)} \\ \iff & \log P(X, Z | \theta) = \log P(Z |X, \theta) + \log P(X |\theta) & \\ \iff & \log P(X | \theta) = \log P(X, Z | \theta) - \log P (Z | X,\theta) & \\ \implies & \log P(X | \theta) = \underbrace{ \mathbb{E}_{Z|X,\theta^{(t)}}{[\log P(X, Z | \theta)]} }_{Q(\theta | \theta^{(t)})} + \underbrace{ - \mathbb{E}_{Z|X,\theta^{(t)}}{[\log P(Z|X,\theta)]} }_{H(\theta | \theta^{(t)})} & \text{\tiny(taking expectation for both side)} \end{aligned} $$

So consider $\log P(X|\theta) - \log P(X|\theta^{(t)})$ is the change in loglikelihood of observed data when we update the parameter vector $\theta$

$$ \begin{aligned} \log P(X|\theta) - \log P(X|\theta^{(t)}) & = Q(\theta|\theta^{(t)}) - Q(\theta^{(t)}|\theta^{(t)}) \\ & + \underbrace{H(\theta|\theta^{(t)}) - H(\theta^{(t)}|\theta^{(t)})}_{A} \\ \end{aligned} $$

Quantity $A$

$$ \begin{aligned} A & = - \mathbb{E}_{Z|X,\theta^{(t)}}{[\log P(Z|X,\theta)]} - \big( -\mathbb{E}_{Z|X,\theta^{(t)}}{[\log P(Z|X,\theta^{(t)})]} \big) \\ & = \mathbb{E}_{Z|X,\theta^{(t)}}{\bigg[ \log P(Z|X,\theta^{(t)}) - \log P(Z|X,\theta) \bigg]} \\ & = \int_{Z} P(Z|X,\theta^{(t)}) \log{\frac{P(Z|X,\theta^{(t)})}{P(Z|X,\theta)} dZ}\\ & \geq 0 &\text{\tiny(Gibb’s inequality)} \end{aligned} $$

So that

$$ \begin{aligned} \log P(X|\theta) - \log P(X|\theta^{(t)}) & = Q(\theta|\theta^{(t)}) - Q(\theta^{(t)}|\theta^{(t)}) + A \\ & \geq Q(\theta|\theta^{(t)}) - Q(\theta^{(t)}|\theta^{(t)}) & \square \end{aligned} $$

Examples

EM for the coin example

Setup

Parameter vector $\boldsymbol{\theta} = [p, q, \tau]$, and its estimation at step (t) is $\boldsymbol{\theta}^{(t)} = [p_t, q_t, \tau_t]$
The $i^{th}$ observation $x^{(i)}$ is either head (H) or tail (T).
The coin selected for the $i^{th}$ trail $z^{(i)}$ is either A or B:
- $p(z^{(i)} = A) = \tau$
- $p(z^{(i)} = B) = 1 -\tau$.
For both cases, $$ \begin{equation} p(z^{(i)}) = \tau^{\mathbb{I}(z^{(i)}=A)}(1-\tau)^{\mathbb{I}(z^{(i)}=B)} \end{equation} $$
When selected the coin A,
- Probability that we get a head (H): $p(x^{(i)}=H | z^{(i)} = A) = p$
- Probability that we get a head (T): $p(x^{(i)}=T | z^{(i)} = A) = 1 - p$
For both cases, $$ \begin{equation} p(x^{(i)} | z^{(i)}=A) = p^{\mathbb{I}(x^{(i)}=H)}(1 - p)^{\mathbb{I}(x^{(i)}=T)} \end{equation} $$
Similarly, when B is selected $$ \begin{equation} p(x^{(i)} | z^{(i)}=B) = q^{\mathbb{I}(x^{(i)}=H)}(1 - q)^{\mathbb{I}(x^{(i)}=T)} \end{equation} $$

Where $\mathbb{I}(\cdot)$ is an indicator function on a predicate $$ \mathbb{I}(p) = \begin{cases} 1 \quad \text{if } p \text{ is True}\\ 0 \quad \text{otherwise} \end{cases} $$

Once again, we generalize for both cases of $z^{(i)}$

$$ \begin{equation} \begin{aligned} p(x^{(i)} | z^{(i)}) = [p^{\mathbb{I}(x^{(i)}=H)}(1 - p)^{\mathbb{I}(x^{(i)}=T)}]^{\mathbb{I}(z^{(i)}=A)}\\ \times [q^{\mathbb{I}(x^{(i)}=H)}(1 - q)^{\mathbb{I}(x^{(i)}=T)}]^{\mathbb{I}(z^{(i)}=B)} \end{aligned} \end{equation} $$

The equation looks rather ugly, we can simplify this by encoding head as 1 and tail as 0; coin A as 1 and coin B as 0. The equation above can be written as

$$ \begin{equation} p(x^{(i)} | z^{(i)}) = [p^{x^{(i)}}(1-p)^{1 - x^{(i)}}]^{z^{(i)}} [q^{x^{(i)}}(1-q)^{1 - x^{(i)}}]^{1-z^{(i)}} \end{equation} $$

Similarly for $p(z^{(i)})$ $$ \begin{equation} p(z^{(i)}) = \tau^{z^{(i)}}(1-\tau)^{1-z^{(i)}} \end{equation} $$

Applying EM algorithm

The (E step):
- Construct the joint likelihood of a single pair of observation and latent variable $p(x^{(i)}, z^{(i)})\ | \boldsymbol{\theta})$. For the conciseness, we drop the $(i)$ superscript from the equation.
$$ \begin{equation} \begin{aligned} p(x, z | \boldsymbol{\theta}) = & p(x | z, \boldsymbol{\theta})p(z | \boldsymbol{\theta})\\ = & [p^{x}(1-p)^{1 - x}]^{z} [q^{x}(1-q)^{1 - x}]^{1-z} \tau^{z}(1-\tau)^{1-z} & \text{\tiny(from eq. 5 and 6)} \end{aligned} \end{equation} $$
- Likelihood over entire observations $\boldsymbol{X}$ and latent $\boldsymbol{Z}$:
  $$\boldsymbol{X}\odot\boldsymbol{Z} := \{(x^{(i)}, z^{(i)})\}_{i=1\cdots N}$$
  A side note is that I am not entirely sure that $\odot$ operator is appropriate in this situation.
  $$ \begin{equation} \begin{aligned} P(\boldsymbol{X}, \boldsymbol{Z} | \boldsymbol{\theta}) =& \prod_{(x, z) \in \boldsymbol{X}\odot\boldsymbol{Z}} { p(x, z | \boldsymbol{\theta}) } \end{aligned} \end{equation} $$
- Log likelihood of the joint probability
  $$ \begin{equation} \begin{aligned} \log P(\boldsymbol{X}, \boldsymbol{Z} | \boldsymbol{\theta}) & = \sum_{(x, z)} \log p(x, z | \boldsymbol{\theta}) \end{aligned} \end{equation} $$
  Taking a log always seem to make thing to be better.
- Finally, we need to take the expectation of the log likelihood w.r.t conditional probability of $\boldsymbol{Z}|\boldsymbol{X}, \boldsymbol{\theta}^{(t)}$
  - Posterior for a single latent $z$
    $$ \begin{equation} \begin{aligned} p(z | x, \boldsymbol{\theta}^{(t)}) & = \frac{p(x, z | \boldsymbol{\theta}^{(t)})} {p(x | \boldsymbol{\theta}^{(t)})} & \text{\tiny(Bayes Theorem)}\\ & = \frac{p(x, z | \boldsymbol{\theta}^{(t)})} { p(x, z = 0| \boldsymbol{\theta}^{(t)}) + p(x, z = 1| \boldsymbol{\theta}^{(t)}) } & \text{\tiny(Marginal likelihood over z in denominator)}\\ & = \frac{ [p_t^{x}(1-p_t)^{1 - x}]^{z} [q_t^{x}(1-q_t)^{1 - x}]^{1-z} \tau_t^{z}(1-\tau_t)^{1-z} }{ q_t^{x}(1-q_t)^{1 - x} (1-\tau_t) + p_t^{x}(1-p_t)^{1 - x}\tau_t } & \text{\tiny(from eq. 7)} \end{aligned} \end{equation} $$
  - Taking the expectation
    $$ \begin{equation} \begin{aligned} Q(\boldsymbol{\theta} | \boldsymbol{\theta}^{(t)}) &= \mathbb{E}_{\boldsymbol{Z} | \boldsymbol{X}, \boldsymbol{\theta}^{(t)}}{\bigg[ \sum_{(x, z)}{\log p(x, z | \boldsymbol{\theta})} \bigg]} \\ &= \sum_{(x, z)} { \mathbb{E}_{\boldsymbol{Z} | \boldsymbol{X}, \boldsymbol{\theta}^{(t)}}{[ \log p(x, z | \boldsymbol{\theta}) ]} } \\ &= \sum_{(x, z)} { \mathbb{E}_{z | x, \boldsymbol{\theta}^{(t)}}{[ \log p(x, z | \boldsymbol{\theta}) ]} } \\ \end{aligned} \end{equation} $$
    It is always bothering for me that in literature, the posterior, of which to be taken expectation over, for the entire set latent variables $\boldsymbol{Z} = \{ z^{(1)}, \cdots z^{(n)}\}$ can be replaced by the posterior for a single latent $z$ in (eq. 11) without explanation. So in order to understand this, consider the equation.
    $$ \begin{aligned} \mathbb{E}_{\boldsymbol{Z}}{\bigg[\sum_{z\in \boldsymbol{Z}}{f(z)}\bigg]} &= \int_{\boldsymbol{Z}}{ \bigg[\sum_{z\in\boldsymbol{Z}} f(z)\bigg] p(\boldsymbol{Z}) d\boldsymbol{Z} } \\ & = \sum_{z\in\boldsymbol{Z}}{\int_{\boldsymbol{Z}}{f(z)}} p(\boldsymbol{Z})d\boldsymbol{Z} \\ & = \sum_{z\in\boldsymbol{Z}}{ \int_{\boldsymbol{Z}\text{/}z} \underbrace{\bigg[\int_{z}f(z)p(z)dz\bigg]}_{A=\mathbb{E}_z[f(z)]} p(\boldsymbol{Z}\text{/}z)d(\boldsymbol{Z}/z) } \\ & = \sum_{z\in\boldsymbol{Z}} A \int_{\boldsymbol{Z}\text{/}z} p(\boldsymbol{Z}\text{/}z)d(\boldsymbol{Z}/z) & \text{\tiny(A is constant w.r.t variable being integrated over)} \\ & = \sum_{z\in\boldsymbol{Z}} \mathbb{E}_z[f(z)] & \text{\tiny(Integerating over a p.d.f evalulated to 1)} \end{aligned} $$
    Where $\boldsymbol{Z} = \{z^i\}_{i=1\cdots N}; z \sim p(Z)$; $\boldsymbol{Z}/z$ denotes set all variables within $\boldsymbol{Z}$ except $z$.
  - Having clear that up, we are able to resume from (eq. 11) $$ \begin{equation} \begin{aligned} Q(\boldsymbol{\theta} | \boldsymbol{\theta}^{(t)}) &= \sum_{(x, z)} { \mathbb{E}_{z | x, \boldsymbol{\theta}^{(t)}}{[ \log p(x, z | \boldsymbol{\theta}) ]} } \\ &= \sum_{(x, z)} {\bigg[ p(z = 0 | x, \boldsymbol{\theta}^{(t)}) \log p(x, z = 0 | \boldsymbol{\theta}) \ + p(z = 1 | x, \boldsymbol{\theta}^{(t)}) \log p(x, z = 1 | \boldsymbol{\theta}) \bigg]} \end{aligned} \end{equation} $$
    From (eq. 7)
    - $$ \begin{aligned} p(x, z = 0 |\boldsymbol{\theta}) = q^x(1-q)^{1-x}(1-\tau) \end{aligned} $$
    - $$ \begin{aligned} p(x, z = 1 |\boldsymbol{\theta}) = p^x(1-p)^{1-x}\tau \end{aligned} $$
    From (eq. 10)
    - $$ \begin{aligned} p(z = 0 | x, \boldsymbol{\theta}^{(t)}) & = \frac{ q_t^{x}(1-q_t)^{1-x}(1-\tau_t) }{ q_t^{x}(1-q_t)^{1 - x} (1-\tau_t) + p_t^{x}(1-p_t)^{1 - x}\tau_t } \end{aligned} $$
    - $$ \begin{aligned} p(z = 1 | x, \boldsymbol{\theta}^{(t)}) & = \frac{ p_t^{x}(1-p_t)^{1-x}\tau_t }{ q_t^{x}(1-q_t)^{1 - x} (1-\tau_t) + p_t^{x}(1-p_t)^{1 - x}\tau_t } \end{aligned} $$
      This probability is often referred as membership probability, denote the membership probability of the $i^{th}$ observation $p(z^{(i)} = 0 | x^{(i)}, \boldsymbol{\theta}^{(t)}) = a_i$ and $p(z^{(i)} = 1|x^{(i)},\boldsymbol{\theta}^{(t)}) = 1 - a_i$. With this
Substitute these quantities into eq. 12 we have the expectation of the log-likelihood w.r.t conditional probability of $\boldsymbol{Z}$ given observations and current state of the parameters.
$$ \begin{aligned} Q(\boldsymbol{\theta} | \boldsymbol{\theta}^{(t)}) = \sum_{i=1}^{N}{ a_i [x^{(i)}\log q + (1-x^{(i)})\log(1-q) + \log(1-\tau)] } \\ + (1 - a_i)[x^{(i)} \log p + (1-x^{(i)})\log (1-p) + log\tau] \end{aligned} $$
The M step $$ \boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}}{Q(\boldsymbol{\theta} | \boldsymbol{\theta}^{(t)})} $$
- $\frac{\partial Q}{\partial p} = 0 $
  $$ \begin{aligned} & \frac{\partial Q}{\partial p} = 0 \\ \iff & \sum_{i=1}^N{(1-a_i)[\frac{ x^{(i)}}{p}} - \frac{1-x^{(i)}}{1-p}] = 0 \\ \iff & \frac{1}{p} \underbrace{\sum_{i=1}^N{(1-a_i)x^{(i)}}}_{A} = \frac{1}{1-p} \underbrace{\sum_{i=1}^N{(1-a_i)(1-x^{(i)})}}_{B} \\ \implies & p = A/(A+B) \\ & = \color{red}{\frac{\sum_{i=1}^N{(1-a_i)x^{(i)}}}{\sum_{i=1}^N{(1-a_i)x^{(i)}}+\sum_{i=1}^N{(1-a_i)(1-x^{(i)})}}} \end{aligned} $$
- $\frac{\partial Q}{\partial q} = 0 $, same with $p$
  $$ \begin{aligned} \color{red}{ q = \frac{\sum_{i=1}^N{a_i x^{(i)}}}{\sum_{i=1}^N{a_i x^{(i)}}+\sum_{i=1}^N{a_i(1-x^{(i)})}} } \end{aligned} $$
- $\frac{\partial Q}{\partial \tau} = 0 $ $$ \begin{aligned} & \frac{\partial Q}{\partial \tau} = 0 \\ \iff & \sum_{1}^N{\frac{-a_i}{1-\tau} + \frac{1-a_i}{\tau}} = 0\\ \iff & \frac{1}{1-\tau}\sum_{1}^N{a_i} = \frac{1}{\tau}\sum_{i=1}^N{(1-a_i)}\\ \implies & \color{red}{\tau = \frac{\sum_{i=1}^N{1-a_i}}{N}} \end{aligned} $$

EM for Gaussian Mixture Model

EM for GMM’s python implementation

Problem#

The EM algorithm#

The algorithm#

Proof of correctness#

Setup#

Proof#

Examples#

EM for the coin example#

EM for Gaussian Mixture Model#