Papers

1. Gibbs Sampling on Mixture of Multinomials
- EM alone
- CEM
- k-means (not model based)
- GS
- GS + EM starting from the MAP likelihood sample
- Other things to try:
  - send email queries to figure out how to normalize the likelihood and to get Min Bayes Risk worked out so that we can throw in another summary technique.
  - don't plot all metrics on same scale; perhaps differences are more visible when zoomed in.
  - work on the test for significance (related to #2)
  - see to what degree these results hold up on the other data sets
  - see to what degree these results hold up on smaller splits (approx. 5000 docs max.), possible with longer chains
  - validate with HBC implementation of Gibbs sampling

Worth doing, perhaps for subsequent paper: 6. compare with Gibbs + EM 7. compare with Variational EM using HBC <br>

2. Metric Survey, Algorithm survey (see below)
- Comparing clustering metrics and algorithms
- Correlations among metrics
- Across many data sets

<br>

3. Publicizing del.icio.us as new data set
- web data
- human labeled
- plentiful

<br>

4. Pick best/fastest clustering algorithm
- experiments on desktop PC data
- experiments on “virtual user”
  - 1 Enron user's email
  - 20 Newsgroups specific to that user's interests in the timeframe of the Enron corpus (harvest Google groups or Usenet corpus)
  - Del.icio.us - use user's interests to crawl web data
  - IM? PDFs? .doc, .ppt, .xls?

Current Experiments

Reuters Reduced: goal to assess clustering metrics in comparison to results from super-reduced
XX On Reuters super-reduced: find mode of joint posterior, compare with modes of per-document marginal posteriors
- In one run consisting of 11,000 samples there were no duplicate samples from the joint.
Pick k using EM and held-out data set.
Implement Andrew Rosenberg's (?) V metric based on his EMNLP paper.
Implement BIC?
Summarize clusters from Gibbs sampling with following techniques (use error bars to reflect confidence intervals around metrics):
- modes of per-document marginal posteriors
- last sample
- MAP sample
- random sample
Identifiability / Label-Switching Drift Map
- diagnose using cluster vs. cluster KL-div. heat map
- see Steyvers & Griffiths 2006 Fig. 8 for idea
- try at different lags within the same run.
Scalability:
- Reuters Full: can we scale?
- Reuters All: can we really scale?

Cluster Survey

We’re doing a big cross-corpus, multi-algorithm, multi-metric text clustering survey. To answer the following: Which clustering algorithm has the following properties:

scores high on all/most metrics

works well on many data sets

runs quickly and in reasonable space

Metrics

Metric List

Enumerate and describe all known clustering metrics; In particular, look into
- Cluster edit distance
- Compactness
- BIC?
- V-metric
- Q0, Q2

Comparing Metrics

Multidimensional scaling

Possible Data Sets

Look at new del.icio.us dataset crawled by Michael
Check PennTags and LibraryThing.com for more labeled data.
Also for more labeled data: Library of Congress Subject Labels
20 Newsgroups at \\entropy\data
Think about how to use Wikipedia data for clustering and classification

Coding Tasks

XX Debugging EM:
- XX Figure out whether we have bleed-over from iteration to iteration. Would it help to zero out some of the data structures between iterations?
- XX Test the assumption that a seeded Random object will produce the same stream.
- XX Verify your log distributions everywhere.
- XX Verify that documents are getting loaded in the same order.
- Initialize model param.s and NOT posteriors (?)
More efficient datum representation as vectors of counts.
- Write Vector class (done 9/22)
- Replace Labeled Datums with vectors in clusterer tester, clusterers
Rewrite code for explicit E-step, M-step division (95%, because of bug)
Try something other than Laplace smoothing, such as Good-Turing to smooth Multinomial NB.
Implement K-means
- Easy K-means using TF-IDF vectors.
- K-means using “hard” version of EM and multinomial count vectors.
- Bisecting K-means
- Read about constrained K-means (Wagstaff & Klein)
- Read about Fast K-means using the Triangle inequality, UCSD

<br>

Assimilate all of Michael Goulding's changes in appropriate way
From Michael:
- Adjusted Rand index = -28752.55 / 1508641.19 = -0.019059

Resources to Investigate

Check out NIPS Workshop on “Learning to Compare Examples”
Read Meila and Heckerman paper. (link from CS 598R reading list).
Estimate Log with Taylor-series expansion for quick evaluation – ask Hal
- Augment wiki page on “Log domain calculations” with this info.
Read Russell and Norvig sections on EM and MCMC for approximate Bayesian inference. Compare with Gauss-Streidel.
Contact Teg to get access to his EM + Gibbs Sampler code as another clusterer to try on our data.
Read Fast Clustering Papers and supplement bibliography
Read Bekkerman & Mccallum paper from 2005 ICML on co-clustering
Get Griffiths & Steyvers LDA code – using Gibbs Sampling
Read Radford Neal Technical Report “Markov chain sampling methods for Dirichlet process mixture models” from 1998
Investigate Multi-view Clustering (like Multi-view Classification) – ref.: Sanjoy Dasgupta and Inderjeet Dhillin.
“Product of Experts – ref.: Geoff Hinton
Look into prior work on using graph cuts and flows to cluster documents

Ideas to Explore

Propose alternate version of EM, inspired by Gauss-Streidel. Every time you change a P(w|c), update the model parameters. For 611: prove convergence. Propose how to parallelize.
Cheeseman and Stutz's approximation to the marginal log likelihood of the data

Newsreader

select best clustering algorithm for keyphrase extraction from news

nlp-private/document-clustering-and-metrics.txt · Last modified: 2015/04/22 14:58 by ryancha

Back to top

Table of Contents

Papers

Current Experiments

Cluster Survey

Metrics

Metric List

Comparing Metrics

Possible Data Sets

Coding Tasks

Resources to Investigate

Ideas to Explore

Newsreader