Differences

This shows you the differences between two versions of the page.

Link to this comparison view

nlp-private:document-clustering-and-metrics [2015/04/22 20:58] (current)
ryancha created
Line 1: Line 1:
 += Papers =
 +
 +* 1. Gibbs Sampling on Mixture of Multinomials
 +** EM alone
 +** CEM
 +** k-means (not model based)
 +** GS
 +** GS + EM starting from the MAP likelihood sample
 +** Other things to try:
 +**# send email queries to figure out how to normalize the likelihood and to get Min Bayes Risk worked out so that we can throw in another summary technique.
 +**# don't plot all metrics on same scale; perhaps differences are more visible when zoomed in.
 +**# work on the test for significance (related to #2)
 +**# see to what degree these results hold up on the other data sets
 +**# see to what degree these results hold up on smaller splits (approx. 5000 docs max.), possible with longer chains
 +**# validate with HBC implementation of Gibbs sampling
 +
 +Worth doing, perhaps for subsequent paper:
 +6. compare with Gibbs + EM
 +7. compare with Variational EM using HBC
 +<br>
 +* 2. Metric Survey, Algorithm survey (see below)
 +** Comparing clustering metrics and algorithms
 +** Correlations among metrics
 +** Across many data sets
 +<br>
 +* 3. Publicizing del.icio.us as new data set
 +** web data
 +** human labeled
 +** plentiful
 +<br>
 +* 4. Pick best/​fastest clustering algorithm
 +** experiments on desktop PC data
 +** experiments on "​virtual user"
 +*** 1 Enron user's email
 +*** 20 Newsgroups specific to that user's interests in the timeframe of the Enron corpus (harvest Google groups or Usenet corpus)
 +*** Del.icio.us - use user's interests to crawl web data
 +*** IM? PDFs? .doc, .ppt, .xls?
 +
 +== Current Experiments ==
 +
 +* Reuters Reduced: goal to assess clustering metrics in comparison to results from super-reduced
 +* XX On Reuters super-reduced:​ find mode of joint posterior, compare with modes of per-document marginal posteriors
 +** In one run consisting of 11,000 samples there were no duplicate samples from the joint.
 +* Pick k using EM and held-out data set.
 +* Implement Andrew Rosenberg'​s (?) V metric based on his EMNLP paper.
 +* Implement BIC?
 +* Summarize clusters from Gibbs sampling with following techniques (use error bars to reflect confidence intervals around metrics):
 +** modes of per-document marginal posteriors
 +** last sample
 +** MAP sample
 +** random sample
 +* Identifiability / Label-Switching Drift Map
 +** diagnose using cluster vs. cluster KL-div. heat map
 +** see Steyvers & Griffiths 2006 Fig. 8 for idea
 +** try at different lags within the '''​same'''​ run.
 +* Scalability:​
 +** Reuters Full: can we scale?
 +** Reuters All: can we really scale?
 +
 += Cluster Survey =
 +We’re doing a big cross-corpus,​ multi-algorithm,​ multi-metric text clustering survey. To answer the following: ​
 +''​Which clustering algorithm has the following properties:
 +* scores high on all/most metrics
 +* works well on many data sets
 +* runs quickly and in reasonable space''​
 +
 +== Metrics ==
 +
 +=== Metric List ===
 +* Enumerate and describe all known clustering metrics; In particular, look into
 +** Cluster edit distance
 +** Compactness
 +** BIC?
 +** V-metric
 +** Q0, Q2
 +
 +=== Comparing Metrics ===
 +* Multidimensional scaling
 +
 +== Possible Data Sets ==
 +* Look at new del.icio.us dataset crawled by Michael
 +* Check PennTags and LibraryThing.com for more labeled data.
 +* Also for more labeled data: Library of Congress Subject Labels
 +* 20 Newsgroups at \\entropy\data
 +* Think about how to use Wikipedia data for clustering and classification
 +
 +== Coding Tasks ==
 +
 +* XX Debugging EM:
 +** XX Figure out whether we have bleed-over from iteration to iteration. ​ Would it help to zero out some of the data structures between iterations?
 +** XX Test the assumption that a seeded Random object will produce the same stream.
 +** XX Verify your log distributions everywhere.
 +** XX Verify that documents are getting loaded in the same order.
 +**Initialize model param.s and NOT posteriors (?)
 +* More efficient datum representation as vectors of counts.
 +** Write Vector class (done 9/22)
 +** Replace Labeled Datums with vectors in clusterer tester, clusterers
 +* Rewrite code for explicit E-step, M-step division (95%, because of bug)
 +* Try something other than Laplace smoothing, such as Good-Turing to smooth Multinomial NB.
 +* Implement K-means
 +** Easy K-means using TF-IDF vectors.
 +** K-means using "​hard"​ version of EM and multinomial count vectors.
 +** Bisecting K-means
 +** Read about constrained K-means (Wagstaff & Klein)
 +** Read about Fast K-means using the Triangle inequality, UCSD
 +<br>
 +* Assimilate all of Michael Goulding'​s changes in appropriate way
 +* From Michael:
 +**Adjusted Rand index = -28752.55 / 1508641.19 = -0.019059
 +
 +== Resources to Investigate ==
 +* Check out NIPS Workshop on "​Learning to Compare Examples"​
 +* Read Meila and Heckerman paper. (link from CS 598R reading list).
 +* Estimate Log with Taylor-series expansion for quick evaluation -- ask Hal
 +** Augment wiki page on "Log domain calculations"​ with this info.
 +* Read Russell and Norvig sections on EM and MCMC for approximate Bayesian inference. ​ Compare with Gauss-Streidel.
 +* Contact Teg to get access to his EM + Gibbs Sampler code as another clusterer to try on our data.
 +* Read Fast Clustering Papers and supplement bibliography
 +* Read Bekkerman & Mccallum paper from 2005 ICML on co-clustering
 +* Get Griffiths & Steyvers LDA code -- using Gibbs Sampling
 +* Read Radford Neal Technical Report "​Markov chain sampling methods for Dirichlet process mixture models"​ from 1998
 +* Investigate Multi-view Clustering (like Multi-view Classification) -- ref.: Sanjoy Dasgupta and Inderjeet Dhillin.
 +* "​Product of Experts -- ref.: Geoff Hinton
 +* Look into prior work on using graph cuts and flows to cluster documents
 +
 +== Ideas to Explore ==
 +* Propose alternate version of EM, inspired by Gauss-Streidel. ​ Every time you change a P(w|c), update the model parameters. ​ For 611: prove convergence. ​ Propose how to parallelize.
 +*Cheeseman and Stutz'​s approximation to the marginal log likelihood of the data
 +
 += Newsreader =
 +* select best clustering algorithm for keyphrase extraction from news
  
nlp-private/document-clustering-and-metrics.txt · Last modified: 2015/04/22 20:58 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0