**This is an old revision of the document!**

Expectation Maximization (from here on called EM) is a method for finding parameter estimates in a probabilistic model where some of the random variables (data or latent variables) are unobserved (often called “missing” when referring to data, but “unobserved” or “hidden” when referring to latent variables). It is a frequentist method that computes a maximum likelihood estimate, and maybe that's part of the reason that it wasn't included in this course. It's not that far from the idea of just sampling from unobserved variables, in that you get repeated estimates for the values of the missing data/hidden nodes. I think that EM is easiest understood in the context of k-means clustering, so I will start with that analogy, then generalize the approach.

K-means is an algorithm for clustering some set of data into <math>K</math> clusters. It is a very simple algorithm that implicitly makes the assumption of normally distributed data (so I suppose you could instead make the implicit assumption explicit and use Bayesian methods instead, but let's not get ahead of ourselves…). The idea is that you start by randomly picking initial centers (or means) for your <math>K</math> clusters, then iteratively recompute where the centers should be. The algorithm is as follows:

Input: K, the number of clusters to find, and D, a set of data points Initialize the K clusters, either completely randomly, or by picking K points from D Let k_points be the set of points assigned to cluster k in K, initially empty Repeat until converged: For each point d in D: For each cluster mean k in K: Calculate the distance between d and k Assign d to the k_points set that had the minimum distance between d and k For each cluster k_points: Assign the cluster mean k to be the mean of the data points in k_points

Determining what “converged” means can be slightly sticky depending on the data, but I won't cover that issue here, as this is just meant for an analogy. The take home point is that K-means is a two step process. The first step is to assign all of the data points to clusters based on some closeness metric (Euclidean distance, in this case). The second step is to recompute the parameter k for each cluster (the mean of the cluster) based on the points that were assigned to it. The jump from here to EM is a very small one.

Instead of assigning each data point completely to one cluster, EM computes *expected* assignments of the data to clusters (hence *Expectation* Maximization). To do EM instead of K-means clustering in our example, we really only have to change one part of the algorithm above, the part where we assign data to clusters. Where K-means gave a hard assignment of each data point to each cluster, we calculate the probability of each data point belonging to each cluster, and add each point to each cluster with a weighting for how likely that point is to end up in that cluster. Thus we again have a two step process, in EM commonly called the *Expectation* step (E-step) and the *Maximization* step (M-step).

In this step we compute expected assignments of data to clusters based on our current estimates of the model parameters:

For each point d in D: For each cluster mean k in K: Calculate the probability p_k that d is assigned to k Add d to each k_points set with a weighting equal p_k

Computing the probability that d is assigned to k can be done in a few ways. In our simple problem from above, we can just compute the distance from the point to each cluster and then normalize all of the distances by their sum, so that they all sum to one. A perhaps more principled approach would be to have a model for <math>p(d|k)</math>, then we can use Bayes law to compute <math>p(k|d)</math> for each k, then normalize the probabilities. If we were assuming normally distributed data, <math>p(d|k)</math> would be a likelihood from a Normal distribution with some mean and some variance whose values were from the current model parameter estimates.

Once we have completed this step, we have the *expected assignments* of all of the data points to clusters. With these expected assignments, we are ready for the M-step.

In the M-step we take the expected assignments of the data and re-estimate the model parameters so as to maximize the likelihood of the expected assignments to each cluster:

For each cluster k_points: Assign the cluster mean k to be the mean of the expected data points in k_points (If your model also has a variance, re-estimate that too)

And then we repeat the process until we have converged. EM is guaranteed to increase the likelihood of the data at every step - it is a simple hill climbing approach that is subject to converge upon a local optimum. Thus EM is *not* guaranteed to find the maximum likelihood estimate of the parameters given the data, though in practice it often does very well. Wikipedia has a nice example of using EM to estimate the mean and variance of a set of clusters in normally distributed data, similar to our K-means example. This is often called a “Gaussian mixture model.” However, the Wikipedia example will probably make more sense after I generalize the description of the algorithm and introduce the notation they use.

So far we have looked at a specific case of the expectation maximization algorithm. From the way I described it the algorithm may seem like it only is applicable in the context of clustering. That is not the case; clustering just provided an easy introduction to the basic idea of the algorithm. Here I will formalize the general EM algorithm.

What the EM algorithm does is iteratively maximize the expected value of the log likelihood function. This is commonly written down as follows.

The expected log likelihood function is first written down:

<math>Q(\theta|\theta^t) = E_{Z|\vec{x},\theta^t}[\log L(\theta; \vec{x}, Z)]</math>

Here <math>\vec{x}</math> is the data that you have, <math>\theta</math> is some set of parameters (the superscript <math>t</math> denotes a particular estimation of them), <math>Z</math> is your set of unobserved variables, and <math>Q</math> is just a dummy name for the function. Let's break this equation down a little bit.

First, we have <math>\log L(\theta; \vec{x}, Z)</math>. This is just the log likelihood function that we've seen plenty of times throughout the class. It is a function of <math>\theta</math>, <math>\vec{x}</math>, and <math>Z</math>, though <math>\vec{x}</math> and <math>Z</math> are considered fixed, and <math>\theta</math> is the only argument of interest (that's what the semicolon is supposed to represent). We know <math>\vec{x}</math>, because it's a set of observed data, but we don't know <math>Z</math>. But then how is <math>Z</math> fixed? That's where the expectation comes in. Remember that an expectation is really just an integral, and we can see from the subscript on the expectation that <math>Z</math> is the variable of integration. So, for any particular point in the integration, <math>Z</math> is fixed, and our log likelihood function makes sense.

So why are we taking an expectation with respect to <math>Z|\vec{x},\theta^t</math>? Because we don't know <math>Z</math>, so we have to average it out somehow. Bayesian methods do some kind of sampling; here we just take an expected value (conditioned on the things that we know, which is the data and our current estimates of the parameters). It turns out that EM can actually give you a probability distribution over <math>Z</math>, just not over <math>\theta</math>, but we don't really need to worry about that here.

Ok, so, computing this equation gives us the log likelihood function of the *parameters* given the data and the previous estimation of the parameters. Doing the calculations necessary for this equation constitutes the E-step mentioned above (if you were to look at the equations on Wikipedia for the Gaussian mixture model, you could convince yourself that this is true - in order to find the log likelihood that's desired, you have to sum over the data). Once we have found <math>Q(\theta|\theta^t)</math>, we can maximize it with respect to <math>\theta</math>:

<math>\theta^{t+1} = \mathrm{argmax}_\theta Q(\theta|\theta^t)</math>

This is the M-step. We are just re-estimating the parameters given the expected log likelihood function that we calculated previously. It hopefully is pretty clear how this relates to the specific M-step described above.

Now that we know the equations, we could go through the example again with the exact math, but Wikipedia does it better than I could, so I just refer you to that page again.

You have a Gaussian Mixture Model with two Gaussians. The variables are distributed as follows:

<math>X_i|Z_i=1\ \sim\ Normal(\mu_1, \Sigma_1)</math>

<math>X_i|Z_i=2\ \sim\ Normal(\mu_2, \Sigma_2)</math>

<math>P(Z_i = 1) = \tau</math>

<math>P(Z_i = 2) = 1 - \tau</math>

All <math>X_i</math> are observed, all <math>Z_i</math> are unobserved. Derive the formula <math>Q(\theta|\theta^t)</math>. Then derive the maximum likelihood update to that formula, thus giving you the new parameters <math>\theta^{t+1}</math>.