This data set consists of a collection of 20,000 posts to 20 usenet usegroups.
By default, the data is tokenized as follows: all header information is discarded and the remaining text is split into tokens that correspond to contiguous sequences of alphabetical characters. This conforms to the procedures followed by other researchers (TODO: fill in names here). By following the same procedures, we hope to be able to reproduce their results as closely as possible.
The 20 Newsgroups data set can be downloaded here. This split in this version has the blind test data removed.