==In Progress== ==To Do Now== * Extract a DeliciousExt data-set with the additional constraint that the chosen del.icio.us tags must also be extractive key-phrases for the documents collected.
* List all features currently used by machine learning. [[Feature List]] ** ? Include log(1+x) as a feature, in addition to x ? * Label 100 documents and re-train model (e.g., with Slashdot-type comments) * Profile freq. of occurrence of features using Weka * Continue feature engineering to improve classification performance using n-fold cross-validation. * Work on extraction of keyphrases from multiple documents (e.g., cluster of documents) * Create a Java interface to support clustering of news items that have been read ** Save each read news item and document as a "FENA" (Features of Entry and Article) XML file. ** include meta-data such as time spent, etc. ==To Do Next== * Revisit the [[pre-requisites for going public and gathering data from others]]. * Add a button to the RSSOwl interface so that users can give feedback on their interest level in a given news story: like / dislike / neutral . We would like to think about how to incorporate this sort of preference info. into the clustering algorithm. * Add a suggestion dialog box * Add results page to design document. * Investigate automatic feature binarization for Weka to enable use of other classifiers (e.g., maxent) ==To Do At Some Later Date== * Add the category field of a feed entry to the learning algorithm * Add menu option to ban a Wikipedia category * Add position features for machine learning. * Check for stop words before adding a User keyphrase (or is this handled by couldBeKeyphrase()?) * Sentence breaking * Try indexing into Wikipedia by single terms, and combinations, and grab all the search results and combine them into something. Prefer all terms first. * Use Lingpipe named-entity detector as a feature * Integrate fully into the GUI, recognizing user settings, languages, etc. * Sync with the newest RSSOwl codebase * Automatically identify when an article page is just an ad; identify link to true article. ==Optional Features== * Ignore comments * Stop words have to be excluded? I run out of heap space otherwise. * Go through web page; look at the value part of attribute for either “topic” or “tag” * Smart folder that is a “meta-feed” ==Probably Never Going Do== * Context-menu option to add keyphrases from RSS blurb * Lookup link text from other pages that link to the webpage, via a search engine. ==Done== * Revisit Dan's comments on the del.icio.us data and recrawl * Move your action list here from Word doc. * Add the feed URL to the FENA * Add Doc. Freq. to ARFF files * Create Learner that saves its model to disk. * Have the Newsreader load the classifier from disk, and use it instead of the baseline model (maybe via switch?). * Somehow use user ratings to influence the learning model. Maybe just have them submit user ratings for performance analysis, and use the FENA data for more training? * Experiment using Wikipedia for better keywords. * Get query search logs from Microsoft Research and ask them about phrase position features in the MoC model. * Good cut-off? I’m thinking 30% * Validate user entered keyphrases by checking them for existence in the FENA. What about Wikipedia non-extractive keywords? * Load the stop words only once * Cache the last 20 or so new item keyword data * Grab the Wikipedia article title. * Rate and save Wikipedia keyphrases * Thread the GUI somehow? How can we halt it when the user clicks rapidly?