= Papers = * 1. Gibbs Sampling on Mixture of Multinomials ** EM alone ** CEM ** k-means (not model based) ** GS ** GS + EM starting from the MAP likelihood sample ** Other things to try: **# send email queries to figure out how to normalize the likelihood and to get Min Bayes Risk worked out so that we can throw in another summary technique. **# don't plot all metrics on same scale; perhaps differences are more visible when zoomed in. **# work on the test for significance (related to #2) **# see to what degree these results hold up on the other data sets **# see to what degree these results hold up on smaller splits (approx. 5000 docs max.), possible with longer chains **# validate with HBC implementation of Gibbs sampling Worth doing, perhaps for subsequent paper: 6. compare with Gibbs + EM 7. compare with Variational EM using HBC
* 2. Metric Survey, Algorithm survey (see below) ** Comparing clustering metrics and algorithms ** Correlations among metrics ** Across many data sets
* 3. Publicizing del.icio.us as new data set ** web data ** human labeled ** plentiful
* 4. Pick best/fastest clustering algorithm ** experiments on desktop PC data ** experiments on "virtual user" *** 1 Enron user's email *** 20 Newsgroups specific to that user's interests in the timeframe of the Enron corpus (harvest Google groups or Usenet corpus) *** Del.icio.us - use user's interests to crawl web data *** IM? PDFs? .doc, .ppt, .xls? == Current Experiments == * Reuters Reduced: goal to assess clustering metrics in comparison to results from super-reduced * XX On Reuters super-reduced: find mode of joint posterior, compare with modes of per-document marginal posteriors ** In one run consisting of 11,000 samples there were no duplicate samples from the joint. * Pick k using EM and held-out data set. * Implement Andrew Rosenberg's (?) V metric based on his EMNLP paper. * Implement BIC? * Summarize clusters from Gibbs sampling with following techniques (use error bars to reflect confidence intervals around metrics): ** modes of per-document marginal posteriors ** last sample ** MAP sample ** random sample * Identifiability / Label-Switching Drift Map ** diagnose using cluster vs. cluster KL-div. heat map ** see Steyvers & Griffiths 2006 Fig. 8 for idea ** try at different lags within the '''same''' run. * Scalability: ** Reuters Full: can we scale? ** Reuters All: can we really scale? = Cluster Survey = We’re doing a big cross-corpus, multi-algorithm, multi-metric text clustering survey. To answer the following: ''Which clustering algorithm has the following properties: * scores high on all/most metrics * works well on many data sets * runs quickly and in reasonable space'' == Metrics == === Metric List === * Enumerate and describe all known clustering metrics; In particular, look into ** Cluster edit distance ** Compactness ** BIC? ** V-metric ** Q0, Q2 === Comparing Metrics === * Multidimensional scaling == Possible Data Sets == * Look at new del.icio.us dataset crawled by Michael * Check PennTags and LibraryThing.com for more labeled data. * Also for more labeled data: Library of Congress Subject Labels * 20 Newsgroups at \\entropy\data * Think about how to use Wikipedia data for clustering and classification == Coding Tasks == * XX Debugging EM: ** XX Figure out whether we have bleed-over from iteration to iteration. Would it help to zero out some of the data structures between iterations? ** XX Test the assumption that a seeded Random object will produce the same stream. ** XX Verify your log distributions everywhere. ** XX Verify that documents are getting loaded in the same order. **Initialize model param.s and NOT posteriors (?) * More efficient datum representation as vectors of counts. ** Write Vector class (done 9/22) ** Replace Labeled Datums with vectors in clusterer tester, clusterers * Rewrite code for explicit E-step, M-step division (95%, because of bug) * Try something other than Laplace smoothing, such as Good-Turing to smooth Multinomial NB. * Implement K-means ** Easy K-means using TF-IDF vectors. ** K-means using "hard" version of EM and multinomial count vectors. ** Bisecting K-means ** Read about constrained K-means (Wagstaff & Klein) ** Read about Fast K-means using the Triangle inequality, UCSD
* Assimilate all of Michael Goulding's changes in appropriate way * From Michael: **Adjusted Rand index = -28752.55 / 1508641.19 = -0.019059 == Resources to Investigate == * Check out NIPS Workshop on "Learning to Compare Examples" * Read Meila and Heckerman paper. (link from CS 598R reading list). * Estimate Log with Taylor-series expansion for quick evaluation -- ask Hal ** Augment wiki page on "Log domain calculations" with this info. * Read Russell and Norvig sections on EM and MCMC for approximate Bayesian inference. Compare with Gauss-Streidel. * Contact Teg to get access to his EM + Gibbs Sampler code as another clusterer to try on our data. * Read Fast Clustering Papers and supplement bibliography * Read Bekkerman & Mccallum paper from 2005 ICML on co-clustering * Get Griffiths & Steyvers LDA code -- using Gibbs Sampling * Read Radford Neal Technical Report "Markov chain sampling methods for Dirichlet process mixture models" from 1998 * Investigate Multi-view Clustering (like Multi-view Classification) -- ref.: Sanjoy Dasgupta and Inderjeet Dhillin. * "Product of Experts -- ref.: Geoff Hinton * Look into prior work on using graph cuts and flows to cluster documents == Ideas to Explore == * Propose alternate version of EM, inspired by Gauss-Streidel. Every time you change a P(w|c), update the model parameters. For 611: prove convergence. Propose how to parallelize. *Cheeseman and Stutz's approximation to the marginal log likelihood of the data = Newsreader = * select best clustering algorithm for keyphrase extraction from news