= Papers =
* 1. Gibbs Sampling on Mixture of Multinomials
** EM alone
** CEM
** k-means (not model based)
** GS
** GS + EM starting from the MAP likelihood sample
** Other things to try:
**# send email queries to figure out how to normalize the likelihood and to get Min Bayes Risk worked out so that we can throw in another summary technique.
**# don't plot all metrics on same scale; perhaps differences are more visible when zoomed in.
**# work on the test for significance (related to #2)
**# see to what degree these results hold up on the other data sets
**# see to what degree these results hold up on smaller splits (approx. 5000 docs max.), possible with longer chains
**# validate with HBC implementation of Gibbs sampling
Worth doing, perhaps for subsequent paper:
6. compare with Gibbs + EM
7. compare with Variational EM using HBC
* 2. Metric Survey, Algorithm survey (see below)
** Comparing clustering metrics and algorithms
** Correlations among metrics
** Across many data sets
* 3. Publicizing del.icio.us as new data set
** web data
** human labeled
** plentiful
* 4. Pick best/fastest clustering algorithm
** experiments on desktop PC data
** experiments on "virtual user"
*** 1 Enron user's email
*** 20 Newsgroups specific to that user's interests in the timeframe of the Enron corpus (harvest Google groups or Usenet corpus)
*** Del.icio.us - use user's interests to crawl web data
*** IM? PDFs? .doc, .ppt, .xls?
== Current Experiments ==
* Reuters Reduced: goal to assess clustering metrics in comparison to results from super-reduced
* XX On Reuters super-reduced: find mode of joint posterior, compare with modes of per-document marginal posteriors
** In one run consisting of 11,000 samples there were no duplicate samples from the joint.
* Pick k using EM and held-out data set.
* Implement Andrew Rosenberg's (?) V metric based on his EMNLP paper.
* Implement BIC?
* Summarize clusters from Gibbs sampling with following techniques (use error bars to reflect confidence intervals around metrics):
** modes of per-document marginal posteriors
** last sample
** MAP sample
** random sample
* Identifiability / Label-Switching Drift Map
** diagnose using cluster vs. cluster KL-div. heat map
** see Steyvers & Griffiths 2006 Fig. 8 for idea
** try at different lags within the '''same''' run.
* Scalability:
** Reuters Full: can we scale?
** Reuters All: can we really scale?
= Cluster Survey =
We’re doing a big cross-corpus, multi-algorithm, multi-metric text clustering survey. To answer the following:
''Which clustering algorithm has the following properties:
* scores high on all/most metrics
* works well on many data sets
* runs quickly and in reasonable space''
== Metrics ==
=== Metric List ===
* Enumerate and describe all known clustering metrics; In particular, look into
** Cluster edit distance
** Compactness
** BIC?
** V-metric
** Q0, Q2
=== Comparing Metrics ===
* Multidimensional scaling
== Possible Data Sets ==
* Look at new del.icio.us dataset crawled by Michael
* Check PennTags and LibraryThing.com for more labeled data.
* Also for more labeled data: Library of Congress Subject Labels
* 20 Newsgroups at \\entropy\data
* Think about how to use Wikipedia data for clustering and classification
== Coding Tasks ==
* XX Debugging EM:
** XX Figure out whether we have bleed-over from iteration to iteration. Would it help to zero out some of the data structures between iterations?
** XX Test the assumption that a seeded Random object will produce the same stream.
** XX Verify your log distributions everywhere.
** XX Verify that documents are getting loaded in the same order.
**Initialize model param.s and NOT posteriors (?)
* More efficient datum representation as vectors of counts.
** Write Vector class (done 9/22)
** Replace Labeled Datums with vectors in clusterer tester, clusterers
* Rewrite code for explicit E-step, M-step division (95%, because of bug)
* Try something other than Laplace smoothing, such as Good-Turing to smooth Multinomial NB.
* Implement K-means
** Easy K-means using TF-IDF vectors.
** K-means using "hard" version of EM and multinomial count vectors.
** Bisecting K-means
** Read about constrained K-means (Wagstaff & Klein)
** Read about Fast K-means using the Triangle inequality, UCSD
* Assimilate all of Michael Goulding's changes in appropriate way
* From Michael:
**Adjusted Rand index = -28752.55 / 1508641.19 = -0.019059
== Resources to Investigate ==
* Check out NIPS Workshop on "Learning to Compare Examples"
* Read Meila and Heckerman paper. (link from CS 598R reading list).
* Estimate Log with Taylor-series expansion for quick evaluation -- ask Hal
** Augment wiki page on "Log domain calculations" with this info.
* Read Russell and Norvig sections on EM and MCMC for approximate Bayesian inference. Compare with Gauss-Streidel.
* Contact Teg to get access to his EM + Gibbs Sampler code as another clusterer to try on our data.
* Read Fast Clustering Papers and supplement bibliography
* Read Bekkerman & Mccallum paper from 2005 ICML on co-clustering
* Get Griffiths & Steyvers LDA code -- using Gibbs Sampling
* Read Radford Neal Technical Report "Markov chain sampling methods for Dirichlet process mixture models" from 1998
* Investigate Multi-view Clustering (like Multi-view Classification) -- ref.: Sanjoy Dasgupta and Inderjeet Dhillin.
* "Product of Experts -- ref.: Geoff Hinton
* Look into prior work on using graph cuts and flows to cluster documents
== Ideas to Explore ==
* Propose alternate version of EM, inspired by Gauss-Streidel. Every time you change a P(w|c), update the model parameters. For 611: prove convergence. Propose how to parallelize.
*Cheeseman and Stutz's approximation to the marginal log likelihood of the data
= Newsreader =
* select best clustering algorithm for keyphrase extraction from news