==Course Questions== *Adam Drake's game review data **Consider running a histogram, etc. to find Positive-Negative with or w/o netural. **Make usable for class *Find a good summarization data set **Used by state-of-the-art approaches, or at least bayesian ones **Hal's paper on Query-based Summarization is a good start *Enable calling of libSVM code from code base without need of text files. *Clean-up/update rubrics for projects. *TODO: Set up the code distribution system better: **Suggestion, move support code to a jar and make stub classes separate from the jar. It can then be distributed, through svn or tarball, and changes should only involve the jar. Also the stub classes can be released with the labs instead of all at once. ==George's TODO== *Handle [[CS-679-Action#Old Material| old material]] section *Solve the course questions I can *Analyze/Cleanup cluster browser as a way to start thinking about visualization *Check out all instruction pages (ie. supercomputer how-to, etc.) ==Old Material== (to be moved up or deleted soon) Organization of codebase: separating edu.berkeley.nlp and edu.byu.nlp Fix handling of held-out data Projects: * Text Class.: get to know tokenization pipeline and do simple text classifier (like k-NN) * Text Class.: Naive Bayes * Text Class.: MaxEnt or SVM * Clustering: k-Means * Clustering: EM * Clustering: LDA * Final Project: help fill our clustering survey matrix ** Pick unique entries from list that add up to a certain number of "difficulty points" Final Presentation: * Implement clustering algorithm from the literature in our framework * Evaluate on given datasets * Do Error Analysis * Propose some improvement * Run experiments using the improved algorithm * Evaluate Datasets: * 20 newsgroups * New (90s) Reuters * Possibly Old (80s) Reuters * Del.icio.us dataset * Wikipedia categories dataset * Enron