See the bottom of this page for latest updates! ===Introduction=== This data set is a collection of newswire articles from 1987. It is located on entropy in /home/data/Reuters/lewis. The data is in 20 SGML files. see http://xml.coverpages.org/sgml.html and http://en.wikipedia.org/wiki/SGML for details on sgml. In short, SGML (Standard Generalized Markup Language) is the predecessor to XML. XML appears to be a subset of SGML, but SGML is not XML. Practically speaking, an XML parser may not be able to parse an SGML document. In the case of this data set (reuters 21578), the files appear to be valid XML. There is a DTD describing the format of the files. ===Custom Split=== I've created a split of this data that is custom from any other split. I simply took all of the articles and randomly selected a set from which to generate test, dev, and blind subsets.