Differences

This shows you the differences between two versions of the page.

Link to this comparison view

nlp-private:intelligent-newsreader [2015/04/22 21:06] (current)
ryancha created
Line 1: Line 1:
 +==In Progress==
 +
 +
 +==To Do Now==
 +
 +* Extract a DeliciousExt data-set with the additional constraint that the chosen del.icio.us tags must also be extractive key-phrases for the documents collected.
 +<br>
 +
 +* List all features currently used by machine learning. ​ [[Feature List]]
 +** ? Include log(1+x) as a feature, in addition to x ?
 +
 +* Label 100 documents and re-train model (e.g., with Slashdot-type comments)
 +
 +* Profile freq. of occurrence of features using Weka
 +
 +* Continue feature engineering to improve classification performance using n-fold cross-validation.
 +
 +* Work on extraction of keyphrases from multiple documents (e.g., cluster of documents)
 +
 +* Create a Java interface to support clustering of news items that have been read
 +** Save each read news item and document as a "​FENA"​ (Features of Entry and Article) XML file.
 +** include meta-data such as time spent, etc.
 +
 +==To Do Next==
 +
 +* Revisit the [[pre-requisites for going public and gathering data from others]].
 +
 +* Add a button to the RSSOwl interface so that users can give feedback on their interest level in a given news story: like / dislike / neutral .  We would like to think about how to incorporate this sort of preference info. into the clustering algorithm.
 +
 +* Add a suggestion dialog box
 +
 +* Add results page to design document.
 +
 +* Investigate automatic feature binarization for Weka to enable use of other classifiers (e.g., maxent)
 +
 +==To Do At Some Later Date==
 +
 +* Add the category field of a feed entry to the learning algorithm
 +
 +* Add menu option to ban  a Wikipedia category
 +
 +* Add position features for machine learning.
 +
 +* Check for stop words before adding a User keyphrase (or is this handled by couldBeKeyphrase()?​)
 +
 +* Sentence breaking
 +
 +* Try indexing into Wikipedia by single terms, and combinations,​ and grab all the search results and combine them into something. Prefer all terms first.
 +
 +* Use Lingpipe named-entity detector as a feature
 +
 +* Integrate fully into the GUI, recognizing user settings, languages, etc.
 +
 +* Sync with the newest RSSOwl codebase
 +
 +* Automatically identify when an article page is just an ad; identify link to true article.
 +
 +==Optional Features==
 +
 +* Ignore comments
 +
 +* Stop words have to be excluded? ​ I run out of heap space otherwise.
 +
 +* Go through web page; look at the value part of attribute for either “topic” or “tag”
 +
 +* Smart folder that is a “meta-feed”
 +
 +==Probably Never Going Do==
 +
 +* Context-menu option to add keyphrases from RSS blurb
 +
 +* Lookup link text from other pages that link to the webpage, via a search engine.
 +
 +==Done==
 +
 +* Revisit Dan's comments on the del.icio.us data and recrawl
 +
 +* Move your action list here from Word doc.
 +
 +* Add the feed URL to the FENA
 +
 +* Add Doc. Freq. to ARFF files
 +
 +* Create Learner that saves its model to disk.
 +
 +* Have the Newsreader load the classifier from disk, and use it instead of the baseline model (maybe via switch?).
 +
 +* Somehow use user ratings to influence the learning model. ​ Maybe just have them submit user ratings for performance analysis, and use the FENA data for more training?
 +
 +* Experiment using Wikipedia for better keywords.
 +
 +* Get query search logs from Microsoft Research and ask them about phrase position features in the MoC model.
 +
 +* Good cut-off? ​ I’m thinking 30%
 +
 +* Validate user entered keyphrases by checking them for existence in the FENA.  What about Wikipedia non-extractive keywords?
 +
 +* Load the stop words only once
 +
 +* Cache the last 20 or so new item keyword data
 +
 +* Grab the Wikipedia article title.
 +
 +* Rate and save Wikipedia keyphrases
 +
 +* Thread the GUI somehow? ​ How can we halt it when the user clicks rapidly?
  
nlp-private/intelligent-newsreader.txt · Last modified: 2015/04/22 21:06 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0