===Week 1: Text Classification with Naive Bayes=== * "A Comparison of Event Models for Naive Bayes Text Classification", by Andrew McCallum and Kamal Nigam. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48. Technical Report WS-98-05. AAAI Press. 1998. [http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf PDF]. * (optional) "Naive Bayes Text Classification: A Statistical Natural Language Processing Project", by Chris Monson [[media:nlp:Chris_Monson.pdf]]. ===Week 2: Semi-Supervised Learning with Naive Bayes and Expectation Maximization=== * "Learning to Classify Text from Labeled and Unlabeled Documents", by Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. [http://www.kamalnigam.com/papers/emcat-aaai98.pdf PDF] (8 pages) * (optional) "Text Classification from Labeled and Unlabeled Documents using EM", by Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Machine Learning, 39(2/3). pp. 103-134. 2000. [http://www.kamalnigam.com/papers/emcat-mlj99.pdf PDF] (34 pages) ===Week 3: Text Classification with Maximum Entropy=== * "Using Maximum Entropy for Text Classification", by Kamal Nigam, John Lafferty, Andrew McCallum. [http://www.cs.cmu.edu/~knigam/papers/maxent-ijcaiws99.pdf PDF] (7 pages) * (optional) "A Maximum Entropy Approach to Natural Language Processing", by Adam Berger, Vincent Della Pietra, Stephen Della Pietra. [http://acl.ldc.upenn.edu/J/J96/J96-1002.pdf PDF] (34 pages) ===Week 4: Feature Selection=== * Mutual information and Log-Likelihood ratio sections in Manning & Schuetze: 5.1-5.4 * (optional) "A comparative study on feature selection for text categorization", by Yiming Yang and Jan Pedersen. [http://citeseer.ifi.unizh.ch/cache/papers/cs/1982/http:zSzzSzwww.cs.cmu.eduzSz~yimingzSzpapers.yyzSzml97.pdf/yang97comparative.pdf PDF] ===Week 5: Feature Selection in the Learning Loop=== * Focus on the section 4 about feature selection in the learning loop: "A Maximum Entropy Approach to Natural Language Processing", by Adam Berger, Vincent Della Pietra, Stephen Della Pietra. [http://acl.ldc.upenn.edu/J/J96/J96-1002.pdf PDF] ===Week 6: Feature Selection as Word Clustering=== * "Distributional Clustering of Words for Text Classification", by Douglas Baker and Andrew McCallum. [http://citeseer.ist.psu.edu/cache/papers/cs/6562/http:zSzzSzwww.cs.cmu.eduzSz~mccallumzSzpaperszSzclustering-sigir98s.pdf/baker98distributional.pdf PDF] * (Optional) Interesting read on similar feature selection mechanism. [http://www.phil.uni-passau.de/linguistik/mitarbeiter/schneider/pub/acl2004.html] [http://www.phil.uni-passau.de/linguistik/mitarbeiter/schneider/pub/acl2004.pdf] ===Week 7: Text Classification with Support Vector Machines=== * Work through as much of the SVM Tutorial by Nello Cristianini as you can. I don't expect you to get all the way through this. Presentation slides from ICML 2001 Tutorial: [http://www.support-vector.net/icml-tutorial.pdf PDF] * "Text Categorization with Support Vector Machines: Learning with Many Relevant Features", by Thorsten Joachims. [http://www.cs.cornell.edu/People/tj/publications/joachims_98a.pdf PDF] ---- Moving on to text clustering ... ===Weeks 8 & 9: Clustering with Naive Bayes=== * "An Experimental Comparison of Several Clustering and Initialization Methods", by Marina Meila and David Heckerman. Try to fight through the whole thing. [http://research.microsoft.com/research/pubs/view.aspx?type=Technical%20Report&id=165 PS] ===Week 10: Bayesian Smoothing=== * "Bayesian smoothing through text classification", by Tom Griffiths.[http://nlp.stanford.edu/courses/cs224n/2001/gruffydd/smoothing.html] ===Week 11: Going Beyond Naive Bayes=== * "Latent Dirichlet Allocation", by D. Blei, A. Ng, and M. Jordan. This is dense. Read as much of this as you can. [http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf PDF] * Blei's code is also available here: [http://www.cs.princeton.edu/~blei/lda-c/] ---- Extra reading: ===Clustering Email=== * "Inferring Ongoing Activities of Workstation Users by Clustering Email". [http://www.cs.cmu.edu/~hyifen/publication/EmailCluster04.pdf PDF] Shorter version: [http://www.cs.cmu.edu/~hyifen/publication/CEAS2004.pdf PDF] * "Automatic Discovery of Personal Topics To Organize Email". [http://research.microsoft.com/~acsuren/PersonalTopics.pdf PDF] by Arun C. Surendran, John C. Platt and Erin Renshaw, Conference on Email and Anti-Spam, 21-22 July at Stanford University, 2005. * "Restrictive Clustering and Metaclustering for Self-Organizing Document Collections". [http://doi.acm.org/10.1145/1008992.1009032]