Papers
Worth doing, perhaps for subsequent paper:
6. compare with Gibbs + EM
7. compare with Variational EM using HBC
<br>
<br>
<br>
Current Experiments
Reuters Reduced: goal to assess clustering metrics in comparison to results from super-reduced
XX On Reuters super-reduced: find mode of joint posterior, compare with modes of per-document marginal posteriors
Pick k using EM and held-out data set.
Implement Andrew Rosenberg's (?) V metric based on his EMNLP paper.
Implement BIC?
Summarize clusters from Gibbs sampling with following techniques (use error bars to reflect confidence intervals around metrics):
Identifiability / Label-Switching Drift Map
diagnose using cluster vs. cluster KL-div. heat map
see Steyvers & Griffiths 2006 Fig. 8 for idea
try at different lags within the same run.
Scalability:
Cluster Survey
We’re doing a big cross-corpus, multi-algorithm, multi-metric text clustering survey. To answer the following:
Which clustering algorithm has the following properties:
scores high on all/most metrics
works well on many data sets
runs quickly and in reasonable space
Metrics
Metric List
Comparing Metrics
Possible Data Sets
Look at new del.icio.us dataset crawled by Michael
Check PennTags and LibraryThing.com for more labeled data.
Also for more labeled data: Library of Congress Subject Labels
-
Think about how to use Wikipedia data for clustering and classification
Coding Tasks
XX Debugging EM:
XX Figure out whether we have bleed-over from iteration to iteration. Would it help to zero out some of the data structures between iterations?
XX Test the assumption that a seeded Random object will produce the same stream.
XX Verify your log distributions everywhere.
XX Verify that documents are getting loaded in the same order.
Initialize model param.s and NOT posteriors (?)
More efficient datum representation as vectors of counts.
Rewrite code for explicit E-step, M-step division (95%, because of bug)
Try something other than Laplace smoothing, such as Good-Turing to smooth Multinomial NB.
Implement K-means
Easy K-means using TF-IDF vectors.
K-means using “hard” version of EM and multinomial count vectors.
Bisecting K-means
Read about constrained K-means (Wagstaff & Klein)
Read about Fast K-means using the Triangle inequality, UCSD
<br>
Resources to Investigate
Check out NIPS Workshop on “Learning to Compare Examples”
Read Meila and Heckerman paper. (link from CS 598R reading list).
Estimate Log with Taylor-series expansion for quick evaluation – ask Hal
Read Russell and Norvig sections on EM and MCMC for approximate Bayesian inference. Compare with Gauss-Streidel.
Contact Teg to get access to his EM + Gibbs Sampler code as another clusterer to try on our data.
Read Fast Clustering Papers and supplement bibliography
Read Bekkerman & Mccallum paper from 2005 ICML on co-clustering
Get Griffiths & Steyvers LDA code – using Gibbs Sampling
Read Radford Neal Technical Report “Markov chain sampling methods for Dirichlet process mixture models” from 1998
Investigate Multi-view Clustering (like Multi-view Classification) – ref.: Sanjoy Dasgupta and Inderjeet Dhillin.
“Product of Experts – ref.: Geoff Hinton
Look into prior work on using graph cuts and flows to cluster documents
Ideas to Explore
Propose alternate version of EM, inspired by Gauss-Streidel. Every time you change a P(w|c), update the model parameters. For 611: prove convergence. Propose how to parallelize.
Cheeseman and Stutz's approximation to the marginal log likelihood of the data
Newsreader
Back to top