Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
cs-401r:assignment-6 [2014/11/29 17:48]
ringger [Data]
cs-401r:assignment-6 [2014/11/29 17:54] (current)
ringger [Feature Selection: Top-N per Document]
Line 119: Line 119:
  
 Top-N per document is a simple method. Top-N per document is a simple method.
-* Choose a value $N$, the number of word types to be initially contributed by each document. 
 * Create an empty feature pool $P$. * Create an empty feature pool $P$.
 +* Choose a value $N$, the number of word types to be initially contributed to the pool by each document. ​ Small values (2-5) of $N$ are often sufficient.
 * For each document $d$, * For each document $d$,
 ** Rank all word types in $d$ by a TF-IDF score, the term frequency inverse document frequency score ** Rank all word types in $d$ by a TF-IDF score, the term frequency inverse document frequency score
Line 127: Line 127:
 ** Remove any word type from $d$ not in $P$. ** Remove any word type from $d$ not in $P$.
  
-Small values (2-5) of $N$ are often sufficient. +Note that nearly all documents will end up with at least $N$ features (word types), but most will end up with more, since the pool $P$ will include word types beyond the document’s top $N$.  Some pathological documents may end up with fewer features or none at all.  Empty documents ​could be removed prior to clustering.
- +
-Note that nearly all documents will end up with at least $N$ features (word types), but most will end up with more, since the pool $P$ will include word types beyond the document’s top $N$.  Some pathological documents may end up with fewer features or none at all.  Empty documents ​can be removed prior to clustering.+
  
 * The TF-IDF score (“weight”) is defined as follows: * The TF-IDF score (“weight”) is defined as follows:
cs-401r/assignment-6.txt · Last modified: 2014/11/29 17:54 by ringger
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0