This shows you the differences between two versions of the page.

— |
nlp-private:document-clustering-and-metrics [2015/04/22 14:58] (current) ryancha created |
||
---|---|---|---|

Line 1: | Line 1: | ||

+ | = Papers = | ||

+ | |||

+ | * 1. Gibbs Sampling on Mixture of Multinomials | ||

+ | ** EM alone | ||

+ | ** CEM | ||

+ | ** k-means (not model based) | ||

+ | ** GS | ||

+ | ** GS + EM starting from the MAP likelihood sample | ||

+ | ** Other things to try: | ||

+ | **# send email queries to figure out how to normalize the likelihood and to get Min Bayes Risk worked out so that we can throw in another summary technique. | ||

+ | **# don't plot all metrics on same scale; perhaps differences are more visible when zoomed in. | ||

+ | **# work on the test for significance (related to #2) | ||

+ | **# see to what degree these results hold up on the other data sets | ||

+ | **# see to what degree these results hold up on smaller splits (approx. 5000 docs max.), possible with longer chains | ||

+ | **# validate with HBC implementation of Gibbs sampling | ||

+ | |||

+ | Worth doing, perhaps for subsequent paper: | ||

+ | 6. compare with Gibbs + EM | ||

+ | 7. compare with Variational EM using HBC | ||

+ | <br> | ||

+ | * 2. Metric Survey, Algorithm survey (see below) | ||

+ | ** Comparing clustering metrics and algorithms | ||

+ | ** Correlations among metrics | ||

+ | ** Across many data sets | ||

+ | <br> | ||

+ | * 3. Publicizing del.icio.us as new data set | ||

+ | ** web data | ||

+ | ** human labeled | ||

+ | ** plentiful | ||

+ | <br> | ||

+ | * 4. Pick best/fastest clustering algorithm | ||

+ | ** experiments on desktop PC data | ||

+ | ** experiments on "virtual user" | ||

+ | *** 1 Enron user's email | ||

+ | *** 20 Newsgroups specific to that user's interests in the timeframe of the Enron corpus (harvest Google groups or Usenet corpus) | ||

+ | *** Del.icio.us - use user's interests to crawl web data | ||

+ | *** IM? PDFs? .doc, .ppt, .xls? | ||

+ | |||

+ | == Current Experiments == | ||

+ | |||

+ | * Reuters Reduced: goal to assess clustering metrics in comparison to results from super-reduced | ||

+ | * XX On Reuters super-reduced: find mode of joint posterior, compare with modes of per-document marginal posteriors | ||

+ | ** In one run consisting of 11,000 samples there were no duplicate samples from the joint. | ||

+ | * Pick k using EM and held-out data set. | ||

+ | * Implement Andrew Rosenberg's (?) V metric based on his EMNLP paper. | ||

+ | * Implement BIC? | ||

+ | * Summarize clusters from Gibbs sampling with following techniques (use error bars to reflect confidence intervals around metrics): | ||

+ | ** modes of per-document marginal posteriors | ||

+ | ** last sample | ||

+ | ** MAP sample | ||

+ | ** random sample | ||

+ | * Identifiability / Label-Switching Drift Map | ||

+ | ** diagnose using cluster vs. cluster KL-div. heat map | ||

+ | ** see Steyvers & Griffiths 2006 Fig. 8 for idea | ||

+ | ** try at different lags within the '''same''' run. | ||

+ | * Scalability: | ||

+ | ** Reuters Full: can we scale? | ||

+ | ** Reuters All: can we really scale? | ||

+ | |||

+ | = Cluster Survey = | ||

+ | We’re doing a big cross-corpus, multi-algorithm, multi-metric text clustering survey. To answer the following: | ||

+ | ''Which clustering algorithm has the following properties: | ||

+ | * scores high on all/most metrics | ||

+ | * works well on many data sets | ||

+ | * runs quickly and in reasonable space'' | ||

+ | |||

+ | == Metrics == | ||

+ | |||

+ | === Metric List === | ||

+ | * Enumerate and describe all known clustering metrics; In particular, look into | ||

+ | ** Cluster edit distance | ||

+ | ** Compactness | ||

+ | ** BIC? | ||

+ | ** V-metric | ||

+ | ** Q0, Q2 | ||

+ | |||

+ | === Comparing Metrics === | ||

+ | * Multidimensional scaling | ||

+ | |||

+ | == Possible Data Sets == | ||

+ | * Look at new del.icio.us dataset crawled by Michael | ||

+ | * Check PennTags and LibraryThing.com for more labeled data. | ||

+ | * Also for more labeled data: Library of Congress Subject Labels | ||

+ | * 20 Newsgroups at \\entropy\data | ||

+ | * Think about how to use Wikipedia data for clustering and classification | ||

+ | |||

+ | == Coding Tasks == | ||

+ | |||

+ | * XX Debugging EM: | ||

+ | ** XX Figure out whether we have bleed-over from iteration to iteration. Would it help to zero out some of the data structures between iterations? | ||

+ | ** XX Test the assumption that a seeded Random object will produce the same stream. | ||

+ | ** XX Verify your log distributions everywhere. | ||

+ | ** XX Verify that documents are getting loaded in the same order. | ||

+ | **Initialize model param.s and NOT posteriors (?) | ||

+ | * More efficient datum representation as vectors of counts. | ||

+ | ** Write Vector class (done 9/22) | ||

+ | ** Replace Labeled Datums with vectors in clusterer tester, clusterers | ||

+ | * Rewrite code for explicit E-step, M-step division (95%, because of bug) | ||

+ | * Try something other than Laplace smoothing, such as Good-Turing to smooth Multinomial NB. | ||

+ | * Implement K-means | ||

+ | ** Easy K-means using TF-IDF vectors. | ||

+ | ** K-means using "hard" version of EM and multinomial count vectors. | ||

+ | ** Bisecting K-means | ||

+ | ** Read about constrained K-means (Wagstaff & Klein) | ||

+ | ** Read about Fast K-means using the Triangle inequality, UCSD | ||

+ | <br> | ||

+ | * Assimilate all of Michael Goulding's changes in appropriate way | ||

+ | * From Michael: | ||

+ | **Adjusted Rand index = -28752.55 / 1508641.19 = -0.019059 | ||

+ | |||

+ | == Resources to Investigate == | ||

+ | * Check out NIPS Workshop on "Learning to Compare Examples" | ||

+ | * Read Meila and Heckerman paper. (link from CS 598R reading list). | ||

+ | * Estimate Log with Taylor-series expansion for quick evaluation -- ask Hal | ||

+ | ** Augment wiki page on "Log domain calculations" with this info. | ||

+ | * Read Russell and Norvig sections on EM and MCMC for approximate Bayesian inference. Compare with Gauss-Streidel. | ||

+ | * Contact Teg to get access to his EM + Gibbs Sampler code as another clusterer to try on our data. | ||

+ | * Read Fast Clustering Papers and supplement bibliography | ||

+ | * Read Bekkerman & Mccallum paper from 2005 ICML on co-clustering | ||

+ | * Get Griffiths & Steyvers LDA code -- using Gibbs Sampling | ||

+ | * Read Radford Neal Technical Report "Markov chain sampling methods for Dirichlet process mixture models" from 1998 | ||

+ | * Investigate Multi-view Clustering (like Multi-view Classification) -- ref.: Sanjoy Dasgupta and Inderjeet Dhillin. | ||

+ | * "Product of Experts -- ref.: Geoff Hinton | ||

+ | * Look into prior work on using graph cuts and flows to cluster documents | ||

+ | |||

+ | == Ideas to Explore == | ||

+ | * Propose alternate version of EM, inspired by Gauss-Streidel. Every time you change a P(w|c), update the model parameters. For 611: prove convergence. Propose how to parallelize. | ||

+ | *Cheeseman and Stutz's approximation to the marginal log likelihood of the data | ||

+ | |||

+ | = Newsreader = | ||

+ | * select best clustering algorithm for keyphrase extraction from news | ||