This assignment is designed to:

- give hands-on experience with a model-based clustering algorithm
- provide experience with expectation maximization on a mixture of gaussians model
- involve you in evaluating cluster quality using metrics and by qualitative inspection

In this project your goal is to build a clustering system, for pixels. You will interpret the pixel clusters as segments of color image(s). You will implement EM on a mixture of gaussians. For this project, you will assume a fixed number of clusters, although you are free to optionally try several cluster counts and determine which one works best for your data. Furthermore, for the sake of comparison, you will make a simple change to your EM algorithm to experiment with CEM (or “hard EM”). You will track and report training set likelihood. For evaluation, you will compute a single extrinsic quantitative metric, known as the Adjusted Rand Index. Optionally, you can also engage in qualitative evaluation of your clustering results.

The ideas necessary for success on this project have been covered in several lectures in class:

- computing the log likelihood of the data with a mixture of multinomials model is described in Lecture #18
- you will need to adapt this idea for the mixture of Gaussians

- the CEM algorithm is addressed in Lecture #18
- you will also need to adapt this idea for the mixture of Gaussians

- methods for initializing the parameters of the model (for EM) are described in Lecture #19
- Gaussian mixture models are described in Lectures #20 and #21

Choose a digital color photograph (640 x 480 is sufficient size) with interesting color variety and use the pixels from the image as your data points. Also choose a simple synthetic image with clear uniform color regions (e.g., a BYU logo). Because there is likely to be no ground truth for the segmentation of your chosen photograph, you will only be able to use an extrinsic metric of clustering quality on your synthetic image.

- Extract features into document feature vectors
- For pixel clustering:
- Represent your pixels in a particular color space.
- The choice of color space is up to you.
- You may also wish to include $(x,y)$ pixel coordinates in your feature vector.

- Choose a value of K (number of clusters), and be able to justify your choice.
- For reminders about how to do your arithmetic in the log domain, please see the following page on the NLP wiki: Log-Domain Arithmetic
- Implement the EM algorithm on a mixture of gaussians model
- Track and report training set likelihood as the algorithm converges.

- Initialize the parameters of your model with a method of your choice.
- For the mixture of gaussians, it is recommended to compute global statistics of the data and perturb the mean vectors as a starting point.

- After you have implemented EM, create a “Classification EM” algorithm by making the necessary changes to your EM implementation. Essentially, the E-step becomes “winner-take-all” instead of computing partial counts.
- Use the same value of K chosen for EM.

- Use a confusion matrix as in Assignment 1.0 to count cluster membership versus labeled class membership. A better name for this structure in the context of clustering is a “contingency table”.
- Also implement the clustering metric known as the Adjusted Rand Index (ARI), and evaluate the metric on your EM and CEM results.
- For pixel clustering, apply the contingency table and ARI metric only to your chosen synthetic image. You are comparing cluster ids versus labeled class (color) membership.
- With regard to experimental procedures, since the initializations are to one degree or another random, running an experiment once doesn't give an accurate picture of the algorithm's performance. For each experiment, run it at least
**five times**and report the min, max, mean, median, and standard deviation for each metric along with the metric values for each run.

*Please limit the non-code portion of you report to about 5 pages.*

For this assignment, write a clear, well-structured, self-contained report on the work you have done, including the following ingredients:

**Time**: Include at the top of page 1 of your report a clear measure (in hours) of how long it took you to complete this project.- [10 points]
**Data**: Dedicate one section of your report to specify your choice of data set. If you chose your own, then explain the source and content of the data.- Include your approach to feature selection, and discuss your intentions for the effect of the feature selector on the kinds of clusters to be found.

- [30 points]
**Design Decisions**:- Systematically describe what you built and what choices you made.
- Describe your approach to initialization, to detecting convergence, etc.
- Also include issues encountered when implementing your model and the EM algorithm. Please do not simply submit a debug or work log.
- Describe your baseline algorithm(s).

- [30 points]
**Results**: Include the log likelihoods, metrics, confusion matrices (contingency tables), etc., for each method (EM and your baseline(s)). - [20 points]
**Qualitative Evaluation**- Evaluate the quality of your clusters / the quality of the segmented image.
- Then discuss the patterns you observe.
- Use tables, diagrams, graphs, and interesting examples, where needed, to make your points, and shares any other insights you think will be helpful in interpreting your work.
- Explain the degree to which this is a satisfying clustering of the data.
- Present the original image augmented with an overlay of color-coded cluster ids.
- Address the following questions:

- How are the results surprising?
- Why do you think your clustering algorithm is working properly?*# Identify some specific problems that are particularly revealing as to why or how the algorithm makes errors.
- If you were going to invest a large amount of effort into raising the usefulness of this system, what would you do and why?

**Feedback**: Include at the end of your report a short section titled “Feedback”. Reflect on your experience in the project and provide any concrete feedback you have for us that would help make this project a better learning experience.**Code**: Include your code at the end of your report.- [10 points]
**Clarity and structure**of your report.

Submit your report as a .pdf document via Learning Suite.

Thanks go to Dan Walker, formerly of the BYU NLP Lab, for collecting the Del.icio.us data set.

There are many libraries that make sampling from the Dirichlet and other prior distributions easy. (e.g., http://www.arbylon.net/projects/knowceans-tools/doc/)

Do most of the E step in log space, since the probabilities you'll be dealing with will be very small and may underflow machine precision. The M step can be in normal space since the probabilities will not be nearly as small. Remember that sums of log probabilities start at $\log(0) = -\text{Infinity}$ and products of logs start at $\log(1) = 0$.

Track convergence using the log-likelihood of the training set. Use a relative threshold, not an absolute one, since different data-sets, different K's, and/or different vocabularies will result in different magnitude of log-likelihood.

It is highly recommended that you start small and build up. Here are some possible steps to take in order to gain confidence that your solution is working:

- Test on a small set of hand-constructed examples (say, around 10 examples) on a small vocabulary.
- If your algorithm is not converging properly, ask a classmate for a 1st round code review.
- Test on a small subset of real documents from your chosen corpus.
- Then, once your code is working on a small subset, scale up to larger data.