- 1. Gibbs Sampling on Mixture of Multinomials
- EM alone
- CEM
- k-means (not model based)
- GS
- GS + EM starting from the MAP likelihood sample
- Other things to try:
- send email queries to figure out how to normalize the likelihood and to get Min Bayes Risk worked out so that we can throw in another summary technique.
- don't plot all metrics on same scale; perhaps differences are more visible when zoomed in.
- work on the test for significance (related to #2)
- see to what degree these results hold up on the other data sets
- see to what degree these results hold up on smaller splits (approx. 5000 docs max.), possible with longer chains
- validate with HBC implementation of Gibbs sampling

Worth doing, perhaps for subsequent paper: 6. compare with Gibbs + EM 7. compare with Variational EM using HBC <br>

- 2. Metric Survey, Algorithm survey (see below)
- Comparing clustering metrics and algorithms
- Correlations among metrics
- Across many data sets

<br>

- 3. Publicizing del.icio.us as new data set
- web data
- human labeled
- plentiful

<br>

- 4. Pick best/fastest clustering algorithm
- experiments on desktop PC data
- experiments on “virtual user”
- 1 Enron user's email
- 20 Newsgroups specific to that user's interests in the timeframe of the Enron corpus (harvest Google groups or Usenet corpus)
- Del.icio.us - use user's interests to crawl web data
- IM? PDFs? .doc, .ppt, .xls?

- Reuters Reduced: goal to assess clustering metrics in comparison to results from super-reduced
- XX On Reuters super-reduced: find mode of joint posterior, compare with modes of per-document marginal posteriors
- In one run consisting of 11,000 samples there were no duplicate samples from the joint.

- Pick k using EM and held-out data set.
- Implement Andrew Rosenberg's (?) V metric based on his EMNLP paper.
- Implement BIC?
- Summarize clusters from Gibbs sampling with following techniques (use error bars to reflect confidence intervals around metrics):
- modes of per-document marginal posteriors
- last sample
- MAP sample
- random sample

- Identifiability / Label-Switching Drift Map
- diagnose using cluster vs. cluster KL-div. heat map
- see Steyvers & Griffiths 2006 Fig. 8 for idea
- try at different lags within the
**same**run.

- Scalability:
- Reuters Full: can we scale?
- Reuters All: can we really scale?

We’re doing a big cross-corpus, multi-algorithm, multi-metric text clustering survey. To answer the following:
*Which clustering algorithm has the following properties:
*

- scores high on all/most metrics
- works well on many data sets
*runs quickly and in reasonable space*

- Enumerate and describe all known clustering metrics; In particular, look into
- Cluster edit distance
- Compactness
- BIC?
- V-metric
- Q0, Q2

- Multidimensional scaling

- Look at new del.icio.us dataset crawled by Michael
- Check PennTags and LibraryThing.com for more labeled data.
- Also for more labeled data: Library of Congress Subject Labels
- 20 Newsgroups at \\entropy\data
- Think about how to use Wikipedia data for clustering and classification

- XX Debugging EM:
- XX Figure out whether we have bleed-over from iteration to iteration. Would it help to zero out some of the data structures between iterations?
- XX Test the assumption that a seeded Random object will produce the same stream.
- XX Verify your log distributions everywhere.
- XX Verify that documents are getting loaded in the same order.
- Initialize model param.s and NOT posteriors (?)

- More efficient datum representation as vectors of counts.
- Write Vector class (done 9/22)
- Replace Labeled Datums with vectors in clusterer tester, clusterers

- Rewrite code for explicit E-step, M-step division (95%, because of bug)
- Try something other than Laplace smoothing, such as Good-Turing to smooth Multinomial NB.
- Implement K-means
- Easy K-means using TF-IDF vectors.
- K-means using “hard” version of EM and multinomial count vectors.
- Bisecting K-means
- Read about constrained K-means (Wagstaff & Klein)
- Read about Fast K-means using the Triangle inequality, UCSD

<br>

- Assimilate all of Michael Goulding's changes in appropriate way
- From Michael:
- Adjusted Rand index = -28752.55 / 1508641.19 = -0.019059

- Check out NIPS Workshop on “Learning to Compare Examples”
- Read Meila and Heckerman paper. (link from CS 598R reading list).
- Estimate Log with Taylor-series expansion for quick evaluation – ask Hal
- Augment wiki page on “Log domain calculations” with this info.

- Read Russell and Norvig sections on EM and MCMC for approximate Bayesian inference. Compare with Gauss-Streidel.
- Contact Teg to get access to his EM + Gibbs Sampler code as another clusterer to try on our data.
- Read Fast Clustering Papers and supplement bibliography
- Read Bekkerman & Mccallum paper from 2005 ICML on co-clustering
- Get Griffiths & Steyvers LDA code – using Gibbs Sampling
- Read Radford Neal Technical Report “Markov chain sampling methods for Dirichlet process mixture models” from 1998
- Investigate Multi-view Clustering (like Multi-view Classification) – ref.: Sanjoy Dasgupta and Inderjeet Dhillin.
- “Product of Experts – ref.: Geoff Hinton
- Look into prior work on using graph cuts and flows to cluster documents

- Propose alternate version of EM, inspired by Gauss-Streidel. Every time you change a P(w|c), update the model parameters. For 611: prove convergence. Propose how to parallelize.
- Cheeseman and Stutz's approximation to the marginal log likelihood of the data

- select best clustering algorithm for keyphrase extraction from news