==In Progress==


==To Do Now==

* Extract a DeliciousExt data-set with the additional constraint that the chosen del.icio.us tags must also be extractive key-phrases for the documents collected.
<br>

* List all features currently used by machine learning.  [[Feature List]]
** ? Include log(1+x) as a feature, in addition to x ?

* Label 100 documents and re-train model (e.g., with Slashdot-type comments)

* Profile freq. of occurrence of features using Weka

* Continue feature engineering to improve classification performance using n-fold cross-validation.

* Work on extraction of keyphrases from multiple documents (e.g., cluster of documents)

* Create a Java interface to support clustering of news items that have been read
** Save each read news item and document as a "FENA" (Features of Entry and Article) XML file.
** include meta-data such as time spent, etc.

==To Do Next==

* Revisit the [[pre-requisites for going public and gathering data from others]].

* Add a button to the RSSOwl interface so that users can give feedback on their interest level in a given news story: like / dislike / neutral .  We would like to think about how to incorporate this sort of preference info. into the clustering algorithm.

* Add a suggestion dialog box

* Add results page to design document.

* Investigate automatic feature binarization for Weka to enable use of other classifiers (e.g., maxent)

==To Do At Some Later Date==

* Add the category field of a feed entry to the learning algorithm

* Add menu option to ban  a Wikipedia category

* Add position features for machine learning.

* Check for stop words before adding a User keyphrase (or is this handled by couldBeKeyphrase()?)

* Sentence breaking

* Try indexing into Wikipedia by single terms, and combinations, and grab all the search results and combine them into something. Prefer all terms first.

* Use Lingpipe named-entity detector as a feature

* Integrate fully into the GUI, recognizing user settings, languages, etc.

* Sync with the newest RSSOwl codebase

* Automatically identify when an article page is just an ad; identify link to true article.

==Optional Features==

* Ignore comments

* Stop words have to be excluded?  I run out of heap space otherwise.

* Go through web page; look at the value part of attribute for either “topic” or “tag”

* Smart folder that is a “meta-feed”

==Probably Never Going Do==

* Context-menu option to add keyphrases from RSS blurb

* Lookup link text from other pages that link to the webpage, via a search engine.

==Done==

* Revisit Dan's comments on the del.icio.us data and recrawl

* Move your action list here from Word doc.

* Add the feed URL to the FENA

* Add Doc. Freq. to ARFF files

* Create Learner that saves its model to disk.

* Have the Newsreader load the classifier from disk, and use it instead of the baseline model (maybe via switch?).

* Somehow use user ratings to influence the learning model.  Maybe just have them submit user ratings for performance analysis, and use the FENA data for more training?

* Experiment using Wikipedia for better keywords.

* Get query search logs from Microsoft Research and ask them about phrase position features in the MoC model.

* Good cut-off?  I’m thinking 30%

* Validate user entered keyphrases by checking them for existence in the FENA.  What about Wikipedia non-extractive keywords?

* Load the stop words only once

* Cache the last 20 or so new item keyword data

* Grab the Wikipedia article title.

* Rate and save Wikipedia keyphrases

* Thread the GUI somehow?  How can we halt it when the user clicks rapidly?