Enron Data Information

This is a descendant of the dataset harvested from Enron's mail server after it was used as evidence against the company and was subsequently put into the public domain. Originally, there were no topic labels available for the Enron dataset and it was therefore appropriate only for unsupervised methods. Recently, the LDC has released a subset of 5,000 emails that were hand-labeled with one of 32 categories. This labeled subset may now be used for classification and external metrics can be used to evaluate clusterings of the annotated subset of the data.

The data set is located on the NLP Lab server here. NOTE: You'll want the indicies_with_ldc_annotations.tgz as well.

Here is a paper describing the data set http://ceas.cc/2004/168.pdf. Here is another paper http://nyc.lti.cs.cmu.edu/yiming/Publications/klimt-ecml04.pdf.

There 619,446 email messages among 158 users (from the above paper). The folks at CMU have cleaned up the data set, so the set that we have (since it came from CMU) has 200,399 messages among 158 users. Take a look at the paper for the finer details.

Organization. The folder /enron/maildir contains all the email data in maildir format. Please see http://en.wikipedia.org/wiki/Maildir for details. In summary, each email is stored in its own file inside a folder hierarchy. The folder /enron/indices/ldc_split/all/ contains index files 01.txt-32.txt. 01.txt contains the relative paths of emails (one per line) labeled as belonging to category 1 by the LDC folks. 02.txt contains the paths of the emails labeled as category 2, and so on.

