stephentu's blog - Random assortment of things

There are literally hundreds of pages of information on the web for how the expectation maximization (EM) algorithm works, but in case you haven't had enough, here's another one. The focus here will be on brevity, so I'm going to ignore some technical details.

Let's start with the problem setup. You have a model $p_{\mathbf{y}}(\mathbf{y};\theta)$ and are given observations $\mathbf{y}$ for which you want to compute a maximum likelihood estimate of $\theta$. That is, you want to compute $$ \DeclareMathOperator*{\argmax}{arg\,max} \newcommand{\Expect}{\mathbb{E}} \newcommand{\mb}[1]{\mathbf{#1}} \hat{\theta}_{ML} = \argmax_{\theta} p_{\mb{y}}(\mb{y};\theta) $$ Suppose, however, for the model you are working with this is intractable for whatever reason. Furthermore, suppose your model contains latent variables $\mb{z}$ such that $p_{\mb{y}}(\mb{y};\theta) = \sum_{\mb{z}} p_{\mb{y},\mb{z}}(\mb{y},\mb{z};\theta)$ (assume we're in a discrete domain for now) and $p_{\mb{y},\mb{z}}(\mb{y},\mb{z};\theta)$ is easy to compute (but there are too many $\mb{z}$'s to make the marginalization practical). EM is an iterative algorithm for maximizing $p_{\mb{y}}(\mb{y};\theta)$ in this scenario

The algorithm is simple. Start with some arbitrary $\theta_0$. Iterate the following until convergence:

Set $U(\theta,\theta_{t}) = \Expect_{\mb{z}|\mb{y};\theta_t}[ \log{p_{\mb{y},\mb{z}}(\mb{y},\mb{z};\theta)}]$ (E-step).
Update $\theta_{t+1} = \argmax_{\theta} U(\theta,\theta_t)$ (M-step).

So why does this work. We start by observing $$ p_{\mb{y}}(\mb{y};\theta) p_{\mb{z}|\mb{y}}(\mb{z}|\mb{y};\theta) = p_{\mb{y},\mb{z}}(\mb{y},\mb{z};\theta) $$ and equivalently, $$ \log p_{\mb{y}}(\mb{y};\theta) = \log p_{\mb{y},\mb{z}}(\mb{y},\mb{z};\theta) - \log p_{\mb{z}|\mb{y}}(\mb{z}|\mb{y};\theta) $$ Taking expectations with respect to $p_{\mb{z}|\mb{y}}(\cdot|\mb{y};\theta')$ for some $\theta'$ and noticing the left hand side does not depend on $\mb{z}$, the above equality yields $$ \log p_{\mb{y}}(\mb{y};\theta) = \Expect_{\mb{z}|\mb{y};\theta'}[\log p_{\mb{y},\mb{z}}(\mb{y},\mb{z};\theta)] - \Expect_{\mb{z}|\mb{y};\theta'}[\log p_{\mb{z}|\mb{y}}(\mb{z}|\mb{y};\theta)] $$ Define $U(\theta,\theta') = \Expect_{\mb{z}|\mb{y};\theta'}[\log p_{\mb{y},\mb{z}}(\mb{y},\mb{z};\theta)]$ and $V(\theta,\theta') = - \Expect_{\mb{z}|\mb{y};\theta'}[\log p_{\mb{z}|\mb{y}}(\mb{z}|\mb{y};\theta)]$. Substituting back yields $$ \log p_{\mb{y}}(\mb{y};\theta) = U(\theta,\theta') + V(\theta,\theta') $$ By the Gibbs' inequality, we have (slightly abusing notation) $$ \begin{align*} V(p_{\theta},p_{\theta'}) &= -\Expect_{p_{\theta'}}[ \log{p_{\theta}} ] \\ &= D(p_{\theta'}||p_{\theta}) + H(p_{\theta'}) \\ &\geq D(p_{\theta'}||p_{\theta'}) + H(p_{\theta'}) \\ &= V(p_{\theta'},p_{\theta'}) \end{align*} $$ where $D(p||q)$ denotes the KL-divergence between two distributions $p$ and $q$ and $H(p)$ denotes the entropy of the distribution $p$. From this we conclude $V(\theta,\theta') \geq V(\theta', \theta')$. Therefore, since $U(\theta_{t+1},\theta_{t}) \geq U(\theta_{t},\theta_{t})$ by the M-step, then $\log p_{\mb{y}}(\mb{y};\theta_{t+1}) \geq \log p_{\mb{y}}(\mb{y};\theta_{t})$ follows immediately. To show $\log p_{\mb{y}}(\mb{y};\theta_{t+1}) > \log p_{\mb{y}}(\mb{y};\theta_{t})$, we require a bit more assumptions. With this point of view, EM can be viewed as maximizing a lower bound on $\log p_{\mb{y}}(\mb{y};\theta)$.

And there we have it. Each iteration of EM is guaranteed to not decrease the log likelihood.

A quick analysis of the expectation maximization algorithm