nlp-private:intelligent-newsreader [CS Wiki]

Trace: • intelligent-newsreader

Table of Contents

In Progress
To Do Now
To Do Next
To Do At Some Later Date
Optional Features
Probably Never Going Do
Done

In Progress

To Do Now

Extract a DeliciousExt data-set with the additional constraint that the chosen del.icio.us tags must also be extractive key-phrases for the documents collected.

<br>

List all features currently used by machine learning. Feature List
- ? Include log(1+x) as a feature, in addition to x ?

Label 100 documents and re-train model (e.g., with Slashdot-type comments)

Profile freq. of occurrence of features using Weka

Continue feature engineering to improve classification performance using n-fold cross-validation.

Work on extraction of keyphrases from multiple documents (e.g., cluster of documents)

Create a Java interface to support clustering of news items that have been read
- Save each read news item and document as a “FENA” (Features of Entry and Article) XML file.
- include meta-data such as time spent, etc.

To Do Next

Revisit the pre-requisites for going public and gathering data from others.

Add a button to the RSSOwl interface so that users can give feedback on their interest level in a given news story: like / dislike / neutral . We would like to think about how to incorporate this sort of preference info. into the clustering algorithm.

Add a suggestion dialog box

Add results page to design document.

Investigate automatic feature binarization for Weka to enable use of other classifiers (e.g., maxent)

To Do At Some Later Date

Add the category field of a feed entry to the learning algorithm

Add menu option to ban a Wikipedia category

Add position features for machine learning.

Check for stop words before adding a User keyphrase (or is this handled by couldBeKeyphrase()?)

Sentence breaking

Try indexing into Wikipedia by single terms, and combinations, and grab all the search results and combine them into something. Prefer all terms first.

Use Lingpipe named-entity detector as a feature

Integrate fully into the GUI, recognizing user settings, languages, etc.

Sync with the newest RSSOwl codebase

Automatically identify when an article page is just an ad; identify link to true article.

Optional Features

Ignore comments

Stop words have to be excluded? I run out of heap space otherwise.

Go through web page; look at the value part of attribute for either “topic” or “tag”

Smart folder that is a “meta-feed”

Probably Never Going Do

Context-menu option to add keyphrases from RSS blurb

Lookup link text from other pages that link to the webpage, via a search engine.

Done

Revisit Dan's comments on the del.icio.us data and recrawl

Move your action list here from Word doc.

Add the feed URL to the FENA

Add Doc. Freq. to ARFF files

Create Learner that saves its model to disk.

Have the Newsreader load the classifier from disk, and use it instead of the baseline model (maybe via switch?).

Somehow use user ratings to influence the learning model. Maybe just have them submit user ratings for performance analysis, and use the FENA data for more training?

Experiment using Wikipedia for better keywords.

Get query search logs from Microsoft Research and ask them about phrase position features in the MoC model.

Good cut-off? I’m thinking 30%

Validate user entered keyphrases by checking them for existence in the FENA. What about Wikipedia non-extractive keywords?

Load the stop words only once

Cache the last 20 or so new item keyword data

Grab the Wikipedia article title.

Rate and save Wikipedia keyphrases

Thread the GUI somehow? How can we halt it when the user clicks rapidly?

nlp-private/intelligent-newsreader.txt · Last modified: 2015/04/22 15:06 by ryancha

Back to top

CC Attribution-Share Alike 4.0 International