This shows you the differences between two versions of the page.

cs-677sp2010:em [2014/12/09 09:54] ryancha created |
cs-677sp2010:em [2014/12/09 09:54] (current) ryancha |
||
---|---|---|---|

Line 7: | Line 7: | ||

===K-means=== | ===K-means=== | ||

- | K-means is an algorithm for clustering some set of data into <math>K</math> clusters. It is a very simple algorithm that implicitly makes the assumption of normally distributed data (so I suppose you could instead make the implicit assumption explicit and use Bayesian methods instead, but let's not get ahead of ourselves...). The idea is that you start by randomly picking initial centers (or means) for your <math>K</math> clusters, then iteratively recompute where the centers should be. The algorithm is as follows: | + | K-means is an algorithm for clustering some set of data into $K$ clusters. It is a very simple algorithm that implicitly makes the assumption of normally distributed data (so I suppose you could instead make the implicit assumption explicit and use Bayesian methods instead, but let's not get ahead of ourselves...). The idea is that you start by randomly picking initial centers (or means) for your $K$ clusters, then iteratively recompute where the centers should be. The algorithm is as follows: |

<pre> | <pre> | ||

Line 39: | Line 39: | ||

</pre> | </pre> | ||

- | Computing the probability that d is assigned to k can be done in a few ways. In our simple problem from above, we can just compute the distance from the point to each cluster and then normalize all of the distances by their sum, so that they all sum to one. A perhaps more principled approach would be to have a model for <math>p(d|k)</math>, then we can use Bayes law to compute <math>p(k|d)</math> for each k, then normalize the probabilities. If we were assuming normally distributed data, <math>p(d|k)</math> would be a likelihood from a Normal distribution with some mean and some variance whose values were from the current model parameter estimates. | + | Computing the probability that d is assigned to k can be done in a few ways. In our simple problem from above, we can just compute the distance from the point to each cluster and then normalize all of the distances by their sum, so that they all sum to one. A perhaps more principled approach would be to have a model for $p(d|k)$, then we can use Bayes law to compute $p(k|d)$ for each k, then normalize the probabilities. If we were assuming normally distributed data, $p(d|k)$ would be a likelihood from a Normal distribution with some mean and some variance whose values were from the current model parameter estimates. |

Once we have completed this step, we have the ''expected assignments'' of all of the data points to clusters. With these expected assignments, we are ready for the M-step. | Once we have completed this step, we have the ''expected assignments'' of all of the data points to clusters. With these expected assignments, we are ready for the M-step. | ||

Line 63: | Line 63: | ||

The expected log likelihood function is first written down: | The expected log likelihood function is first written down: | ||

- | <math>Q(\theta|\theta^t) = E_{Z|\vec{x},\theta^t}[\log L(\theta; \vec{x}, Z)]</math> | + | $Q(\theta|\theta^t) = E_{Z|\vec{x},\theta^t}[\log L(\theta; \vec{x}, Z)]$ |

- | Here <math>\vec{x}</math> is the data that you have, <math>\theta</math> is some set of parameters (the superscript <math>t</math> denotes a particular estimation of them), <math>Z</math> is your set of unobserved variables, and <math>Q</math> is just a dummy name for the function. Let's break this equation down a little bit. | + | Here $\vec{x}$ is the data that you have, $\theta$ is some set of parameters (the superscript $t$ denotes a particular estimation of them), $Z$ is your set of unobserved variables, and $Q$ is just a dummy name for the function. Let's break this equation down a little bit. |

- | First, we have <math>\log L(\theta; \vec{x}, Z)</math>. This is just the log likelihood function that we've seen plenty of times throughout the class. It is a function of <math>\theta</math>, <math>\vec{x}</math>, and <math>Z</math>, though <math>\vec{x}</math> and <math>Z</math> are considered fixed, and <math>\theta</math> is the only argument of interest (that's what the semicolon is supposed to represent). We know <math>\vec{x}</math>, because it's a set of observed data, but we don't know <math>Z</math>. But then how is <math>Z</math> fixed? That's where the expectation comes in. Remember that an expectation is really just an integral, and we can see from the subscript on the expectation that <math>Z</math> is the variable of integration. So, for any particular point in the integration, <math>Z</math> is fixed, and our log likelihood function makes sense. | + | First, we have $\log L(\theta; \vec{x}, Z)$. This is just the log likelihood function that we've seen plenty of times throughout the class. It is a function of $\theta$, $\vec{x}$, and $Z$, though $\vec{x}$ and $Z$ are considered fixed, and $\theta$ is the only argument of interest (that's what the semicolon is supposed to represent). We know $\vec{x}$, because it's a set of observed data, but we don't know $Z$. But then how is $Z$ fixed? That's where the expectation comes in. Remember that an expectation is really just an integral, and we can see from the subscript on the expectation that $Z$ is the variable of integration. So, for any particular point in the integration, $Z$ is fixed, and our log likelihood function makes sense. |

- | So why are we taking an expectation with respect to <math>Z|\vec{x},\theta^t</math>? Because we don't know <math>Z</math>, so we have to average it out somehow. Bayesian methods do some kind of sampling; here we just take an expected value (conditioned on the things that we know, which is the data and our current estimates of the parameters). It turns out that EM can actually give you a probability distribution over <math>Z</math>, just not over <math>\theta</math>, but we don't really need to worry about that here. | + | So why are we taking an expectation with respect to $Z|\vec{x},\theta^t$? Because we don't know $Z$, so we have to average it out somehow. Bayesian methods do some kind of sampling; here we just take an expected value (conditioned on the things that we know, which is the data and our current estimates of the parameters). It turns out that EM can actually give you a probability distribution over $Z$, just not over $\theta$, but we don't really need to worry about that here. |

- | Ok, so, computing this equation gives us the log likelihood function of the ''parameters'' given the data and the previous estimation of the parameters. Doing the calculations necessary for this equation constitutes the E-step mentioned above (if you were to look at the equations on Wikipedia for the Gaussian mixture model, you could convince yourself that this is true - in order to find the log likelihood that's desired, you have to sum over the data). Once we have found <math>Q(\theta|\theta^t)</math>, we can maximize it with respect to <math>\theta</math>: | + | Ok, so, computing this equation gives us the log likelihood function of the ''parameters'' given the data and the previous estimation of the parameters. Doing the calculations necessary for this equation constitutes the E-step mentioned above (if you were to look at the equations on Wikipedia for the Gaussian mixture model, you could convince yourself that this is true - in order to find the log likelihood that's desired, you have to sum over the data). Once we have found $Q(\theta|\theta^t)$, we can maximize it with respect to $\theta$: |

- | <math>\theta^{t+1} = \mathrm{argmax}_\theta Q(\theta|\theta^t)</math> | + | $\theta^{t+1} = \mathrm{argmax}_\theta Q(\theta|\theta^t)$ |

This is the M-step. We are just re-estimating the parameters given the expected log likelihood function that we calculated previously. It hopefully is pretty clear how this relates to the specific M-step described above. | This is the M-step. We are just re-estimating the parameters given the expected log likelihood function that we calculated previously. It hopefully is pretty clear how this relates to the specific M-step described above. | ||

Line 83: | Line 83: | ||

You have a Gaussian Mixture Model with two Gaussians. The variables are distributed as follows: | You have a Gaussian Mixture Model with two Gaussians. The variables are distributed as follows: | ||

- | <math>X_i|Z_i=1\ \sim\ Normal(\mu_1, \Sigma_1)</math> | + | $X_i|Z_i=1\ \sim\ Normal(\mu_1, \Sigma_1)$ |

- | <math>X_i|Z_i=2\ \sim\ Normal(\mu_2, \Sigma_2)</math> | + | $X_i|Z_i=2\ \sim\ Normal(\mu_2, \Sigma_2)$ |

- | <math>P(Z_i = 1) = \tau</math> | + | $P(Z_i = 1) = \tau$ |

- | <math>P(Z_i = 2) = 1 - \tau</math> | + | $P(Z_i = 2) = 1 - \tau$ |

- | All <math>X_i</math> are observed, all <math>Z_i</math> are unobserved. Derive the formula <math>Q(\theta|\theta^t)</math>. Then derive the maximum likelihood update to that formula, thus giving you the new parameters <math>\theta^{t+1}</math>. | + | All $X_i$ are observed, all $Z_i$ are unobserved. Derive the formula $Q(\theta|\theta^t)$. Then derive the maximum likelihood update to that formula, thus giving you the new parameters $\theta^{t+1}$. |