This assignment is designed to:

- give hands-on experience with a model-based clustering algorithm
- provide experience with expectation maximization on a straightforward probabilistic model, the mixture of multinomials model
- involve you in evaluating cluster quality using metrics and (optionally) by qualitative inspection

In this project your goal is to build a clustering system for documents. You will implement EM on a mixture of multinomials. For this project, you will assume a fixed number of clusters, although you are free to optionally try several cluster counts and determine which one works best for your data. Furthermore, for the sake of comparison, you will make a simple change to your EM algorithm to experiment with CEM (or “hard EM”). You will track and report training set likelihood. For evaluation, you will compute a single extrinsic quantitative metric, known as the Adjusted Rand Index (ARI). Optionally, you can also engage in qualitative evaluation of your clustering results.

The ideas necessary for success on this project have been covered in several lectures in class:

- the mixture of multinomials model is described in Lecture #17
- computing the log likelihood of the data with the model is described in Lecture #18
- the EM algorithm on the mix. of multinomials model is described in Lecture #18
- the CEM algorithm is likewise addressed in Lecture #18
- methods for initializing the parameters of the model (for EM) are described in Lecture #19

You will be working with labeled document data; however, pretend that it is unlabeled for the purpose of clustering. The labels will come back later to help with cluster evaluation. You will work with a document collection of your choice. To be clear, for our purposes, documents are collections of words. Available options include:

- the 20 Newsgroups data set
- the email data from Enron
- the del.icio.us web page data set
- or yet another collection of your choice.

Whichever set you choose, it must consist of at least 1000 documents (with labels). (Note that a 1000 document subset of one of the named datasets would also suffice; e.g., take a slice of 50 documents per newsgroup from 20 Newsgroups).

[5 points]

- Extract features into document feature vectors
- Use selected word types as the evidence. The method of selection is up to you.
- Note that techniques like distributional mutual information (aka information gain) are not legal, since they would require looking at the labels. Looking at the labels is cheating.
- Even more so than in document classification, features matter in document clustering, because they establish which patterns will be salient to the algorithm. If you find that your clusterer is forcing everything into just a cluster or two, then you probably need to reconsider the diversity of features you are extracting from your documents.
- We recommend top-N per document for topical clustering (see below for details).

[30 points]

- Choose a value of K (number of clusters), and be able to justify your choice.
- For reminders about how to do your arithmetic in the log domain, please see the following page on the NLP wiki: Log-Domain Arithmetic
- The EM algorithm on a mixture of multinomials model
- Track and report training set likelihood as the algorithm converges.

[5 points]

- Initialize the parameters of your model with a method of your choice.
- Smooth the parameters in the M-step as needed, using a prior of your choice.
- We discussed 3 ways to initialize the parameters of the model. Choose one, and justify it.

[5 points]

- After you have implemented EM, create a “Classification EM” algorithm by making the necessary changes to your EM implementation, as discussed in class. Essentially, the E-step becomes “winner-take-all” instead of computing partial counts.
- Use the same value of K chosen for EM.

[10 points]

- Use a confusion matrix as in Assignment 1.0 to count cluster membership versus labeled class membership. A better name for this structure in the context of clustering is a “contingency table”.
- Also implement the clustering metric known as the Adjusted Rand Index (ARI), and evaluate the metric on your EM and CEM results.
- With regard to experimental procedures, since the initializations are to one degree or another random, running an experiment once doesn't give an accurate picture of the algorithm's performance. For each experiment, run it at least
**five times**and report the min, max, mean, median, and standard deviation for each metric along with the metric values for each run.

[up to 20 points]

Choose one of the following approaches (one of the sub-bullets) to evaluating the quality of your clusters. Then discuss the patterns you observe. Use tables, diagrams, graphs, and interesting examples, where needed, to make your points, and shares any other insights you think will be helpful in interpreting your work. Explain the degree to which this is a satisfying clustering of the data.

- Present example clusters and subjectively assess the quality of the clusters.
- Show the top predictive word types according to $p(c | w)$ for each cluster.
- Show the top representative word types according to $p(w | c)$ for each cluster.

*Please limit the non-code portion of you report to about 5 pages.*

For this assignment, write a clear, well-structured, self-contained report on the work you have done, including the following ingredients:

**Time**: Include at the top of page 1 of your report a clear measure (in hours) of how long it took you to complete this project.- [5 points]
**Data**: Dedicate one section of your report to specify your choice of data set. If you chose your own, then explain the source and content of the data.- Include your approach to feature selection, and discuss your intentions for the effect of the feature selector on the kinds of clusters to be found.

- [10 points]
**Design Decisions**: Specify what you built and what choices you made. Please do not simply submit a debug or work log.- Also include issues encountered when implementing EM.

- [10 points]
**Results**: Include the log likelihoods, metrics, confusion matrices (contingency tables), etc., for each method. - [10 points]
**Questions**: Briefly address the following questions:- How are the results surprising?
- Why do you think your clustering algorithm is working properly?
- Identify some specific problems that are particularly revealing as to why or how the algorithm makes errors.
- If you were going to invest a large amount of effort into raising the usefulness of this system, what would you do and why?

**Feedback**: Include at the end of your report a short section titled “Feedback”. Reflect on your experience in the project and provide any concrete feedback you have for us that would help make this project a better learning experience.**Code**: Include your code at the end of your report.- [10 points]
**Clarity and structure**of your report.

Submit your report as a .pdf document via Learning Suite.

Thanks go to Dan Walker, formerly of the BYU NLP Lab, for collecting the Del.icio.us data set.

There are many libraries that make sampling from the Dirichlet and other prior distributions easy. (e.g., http://www.arbylon.net/projects/knowceans-tools/doc/)

Do some smoothing on $p(w|c)$. If you do not, zeros will result in errors and improper distributions. Simple add-1 (or add-delta) should suffice for this project.

Do most of the E step in log space, since the probabilities you'll be dealing with will be very small and may underflow machine precision. The M step can be in normal space since the probabilities will not be nearly as small. Remember that sums of log probabilities start at $\log(0) = -\text{Infinity}$ and products of logs start at $\log(1) = 0$.

Track convergence using the log-likelihood of the training set. Use a relative threshold, not an absolute one, since different data-sets, different K's, and/or different vocabularies will result in different magnitude of log-likelihood.

Top-N per document is a simple method.

- Create an empty feature pool $P$.
- Choose a value $N$, the number of word types to be initially contributed to the pool by each document. Small values (2-5) of $N$ are often sufficient.
- For each document $d$,
- Rank all word types in $d$ by a TF-IDF score, the term frequency inverse document frequency score
- Add the top $N$ word types from $d$ to $P$.

- For each document $d$,
- Remove any word type from $d$ not in $P$.

Note that nearly all documents will end up with at least $N$ features (word types), but most will end up with more, since the pool $P$ will include word types beyond the document’s top $N$. Some pathological documents may end up with fewer features or none at all. Empty documents could be removed prior to clustering.

- The TF-IDF score (“weight”) is defined as follows:
- $weight(i,j) = (1+log(tf_{i,j}))\cdot log(M/df_i)$ if $tf_{i,j} \geq 1$
- $weight(i,j) = 0$ if $tf_{i,j} = 0$

- Where:
- $tf_{i,j}$ is defined to be the frequency of word type $i$ in document $j$. It is synonymous with our notation $x_{i,j}$ (the $j$th entry in document $i$'s type count vector).
- $df_i$ is defined to be the number of documents in which word type $i$ occurs.
- $M$ is the total number of documents.

Regarding using $p(c|w)$ to capture the most significant words for a cluster, imagine there is a word that is very common in every cluster, such as “the”. The word “the” may be the most probable word in each cluster (i.e., have the highest $p(w|c) = beta_{c,w}$), but it's not particularly interesting because it's equally high for all clusters and doesn't help distinguish among them. By calculating the $p(c|w)$, on the other hand, we are instead identifying the words that are most indicative of each cluster–the words that if I see them push me hardest towards a given cluster. Concretely, in the calculation $p(c|w) = p(w|c)p(c) / \sum_k p(w|k)p(k)$, a word that is highly likely in all classes will be penalized by a large denominator, favoring words that are highly likely in as few clusters as possible.

It is highly recommended that you start small and build up. Here are some possible steps to take in order to gain confidence that your solution is working:

- Test on a small set of hand-constructed examples (say, around 10 examples) on a small vocabulary.
- If your algorithm is not converging properly, ask a classmate for a 1st round code review.
- Test on a small subset of real documents from your chosen corpus.
- Then, once your code is working on a small subset, scale up to larger data.