== Classification/Clustering Datasets == The latest way to get the datasets, including the data, the split indices, and associated scripts is through a special subversion repository. svn checkout http://nlp.cs.byu.edu/subversion/data/trunk . This will check out everything except the actual data (indices and scripts) from the repository. Each dataset is a directory organized as follows: dataset_root + |- README |- indices + | |- | |-... | |- scripts + |-getData |- |-... The scripts/getData script is setup to copy the directories containing the corresponding data files to the dataset_root directory. After running the script, the dataset should be ready to run. ===Enron=== This is a descendant of the dataset harvested from Enron's mail server after it was used as evidence against the company and was subsequently put into the public domain. Originally, there were no topic labels available for the Enron dataset and it was therefore appropriate only for unsupervised methods. Recently, the LDC has released a set of annotations for the dataset and so it may now be used for classification and external metrics can be used to evaluate clusterings of the annotated subset of the data. === Movie Review === This data set is derived from the data set provided by Bo Pang and Lillian Lee at Cornell. Only one split is provided, which supports both clustering and classification. The data set creators divided the reviews into 3 classes, based on normalized scores given by the authors in their review. In the indices provided here, these three classes have been labeled Positive, Negative and Neutral. Details of the data gathering process and class determination procedure can be found [http://www.cs.cornell.edu/people/pabo/movie%2Dreview%2Ddata/ here], particularly in [http://www.cs.cornell.edu/people/pabo/movie-review-data/scaledata.README.1.0.txt this README]. The data consists of movie review written by four authors. Particularly: Author - number of documents Dennis Schwartx - 1027 James Berardinelli - 1307 Scott Renshaw - 902 Steve Rhodes - 1770 For a total of 5006 documents. The data has been divided into training, dev-test, and blind-test as follows: ====Training:==== Negative - 969 Neutral - 1532 Positive - 1504 Total - 4005 ====Dev Test:==== Negative - 121 Neutral - 179 Positive - 201 Total - 501 ====Blind Test:==== Negative - 107 Neutral - 204 Positive - 189 Total - 500 === Reuters === === Usenet === === Social Bookmarking === There are two related but distinct datasets crawled from the del.icio.us bookmarking site. To use the old dataset, point to the del.icio.us/indices directory as the index to use. To use the new dataset, point to del.icio.us/new_indices. ====Old Social Bookmarking==== This data set was crawled from the popular social bookmaring site del.icio.us by Michael Goulding. Del.icio.us uses a tag-based system, where each bookmark can be assigned user-defined tags for organizational and sharing purposes. Dr. Ringger and Dan each chose 25 key words, or topics and the del.icio.us search facilities were used to find documents that had these topic labels as one of their tags. There were quite a few spam postings in the resulting data, and so a few heuristics were applied to cull the "real" pages from the spam. The intention was to gather 50 documents each from the 50 topics, but, after spam filtering, several topics ended up containing significantly fewer documents. Two splits are provided for the social bookmarking data set, the full set (full_set), which contains 2307 documents, and a reduced set(tiny_set), which contains 396 documents. We would like to re-crawl this data set, because: * Documents were filtered to exclude any that don't contain the topic name as part of the text (to match the need of Michael's work at the time of keyword extraction) * Some topics have as few as 4 members, and we would like to have at least 50 documents for each * While most of the spam problems were solved by heuristic filtering, the data still includes documents that aren't "content" pages. This includes pages that are forms, or that consist of mostly javascript code. {| border="1" |+ Topic labels in the Old Social Bookmarking dataset |- ! align="left" | ajax |50 |- ! align="left" | algorithm |34 |- ! align="left" | applescript |50 |- ! align="left" | biblical archaeology |39 |- ! align="left" | byu |41 |- ! align="left" | clustering |50 |- ! align="left" | copyright |50 |- ! align="left" | dell computers |50 |- ! align="left" | diebold |50 |- ! align="left" | discriminative training |4 |- ! align="left" | final fantasy |50 |- ! align="left" | fuel cell |50 |- ! align="left" | games |41 |- ! align="left" | gardening |50 |- ! align="left" | google |50 |- ! align="left" | gtd |50 |- ! align="left" | hezbollah |50 |- ! align="left" | home theater |50 |- ! align="left" | howto |50 |- ! align="left" | ipod |50 |- ! align="left" | java |39 |- ! align="left" | kohler |50 |- ! align="left" | language identification |44 |- ! align="left" | lawncare |50 |- ! align="left" | machine learning |44 |- ! align="left" | mac |50 |- ! align="left" | mitt romney |50 |- ! align="left" | mosaic tile |50 |- ! align="left" | nanotechnology |50 |- ! align="left" | natural language processing |37 |- ! align="left" | news aggregator |42 |- ! align="left" | osx |34 |- ! align="left" | patents |50 |- ! align="left" | pedometer |32 |- ! align="left" | photography |50 |- ! align="left" | podcasts |50 |- ! align="left" | power supplies |50 |- ! align="left" | productivity |50 |- ! align="left" | programming |39 |- ! align="left" | psp |50 |- ! align="left" | riaa |50 |- ! align="left" | ruby |41 |- ! align="left" | sco |50 |- ! align="left" | security |50 |- ! align="left" | sprinklers |50 |- ! align="left" | text mining |50 |- ! align="left" | thai recipes |50 |- ! align="left" | translation |46 |- ! align="left" | wii |50 |- ! align="left" | youtube |50 |- ! align="left" | Total | 2307 |} ====New Social Bookmarking==== Also crawled from del.icio.us. This dataset has been updated to correct errors made when collecting the first. For example, the old social bookmarking data was collected by using the del.icio.us search engine. This yielded pertinent results, but missed the point of leveraging the manual tagging by del.icio.us users of the bookmarked websites. Instead of retrieving pages that were tagged with the topic labels, the old data set consisted of any pages that mentioned the topic words in the title or page description, this is most likely at least part of the reason that the original dataset needed so much filtering to remove spam. The new dataset was crawled using the Ruby script found in the scripts directory of the del.icio.us dataset root directory. The keyphrases used are found in the file keyphrases.txt. Some of the topic labels consist of multiple words. This was handled either by # Concatenating the words into a single word - this means that all of the documents in this category must have been tagged by at least one del.icio.us user with the concatenated string. # Delimiting the words with the '+' character - this means that all of the documents in this category must have been tagged with each of the individual words in the keyphrase by at least one user. This has the effect of taking the intersection of the documents tagged with each word in the keyphrase. Each of the above methods were attempted manually first, in order to determine which was more appropriate for that particular keyphrase. Sometimes one way or the other would not produce enough documents to reach the quota of 100 documents per topic. The file del.icio.us/new_data/urls.txt contains an index that maps from each document file to the URL where the document was collected from. '''NOTE:''' One of the files in the dataset (del.icio.us/new_data/applescript/doc01385.html) contains snippets of code from the ILOVEYOU VBS virus, and might be identified as malware by some signature-based anti-virus engines. The file does not contain enough of the code to run, and is quite safe, however. {| border="1" |+ Topic labels in the New Social Bookmarking dataset ! Topic Label !! Document Count |- !align="left" | ajax | 500 |- !align="left" | algorithms | 499 |- !align="left" | applescript | 500 |- !align="left" | BYU | 398 |- !align="left" | clustering | 500 |- !align="left" | copyright | 500 |- !align="left" | Dell+Computers | 380 |- !align="left" | diebold | 500 |- !align="left" | finalfantasy | 500 |- !align="left" | fuelcell | 500 |- !align="left" | games | 500 |- !align="left" | gardening | 500 |- !align="left" | google | 499 |- !align="left" | gtd | 500 |- !align="left" | Hezbollah | 500 |- !align="left" | hometheater | 500 |- !align="left" | howto | 500 |- !align="left" | ipod | 500 |- !align="left" | java | 500 |- !align="left" | Kohler | 84 |- !align="left" | lawncare | 238 |- !align="left" | linux | 500 |- !align="left" | mac | 500 |- !align="left" | machinelearning | 500 |- !align="left" | MittRomney | 273 |- !align="left" | mosaic+tile | 69 |- !align="left" | nanotechnology | 500 |- !align="left" | natural+language+processing | 167 |- !align="left" | news+aggregator | 500 |- !align="left" | osx | 500 |- !align="left" | patents | 500 |- !align="left" | pedometer | 267 |- !align="left" | photography | 346 |- !align="left" | podcasts | 390 |- !align="left" | power+supply | 498 |- !align="left" | productivity | 500 |- !align="left" | programming | 500 |- !align="left" | riaa | 500 |- !align="left" | ruby | 451 |- !align="left" | SCO | 500 |- !align="left" | security | 500 |- !align="left" | sony | 403 |- !align="left" | sprinkler | 189 |- !align="left" | textmining | 500 |- !align="left" | thai+recipes | 500 |- !align="left" | translation | 500 |- !align="left" | wii | 500 |- !align="left" | windows | 500 |- !align="left" | youtube | 478 |- !align="left" | Total | 21627 |} === 20 Newsgroups === This is the venerable 20 Newsgroups dataset introduced by Thorsten Joachims in his 1997 ICML paper. It has been split according to the description of Joachims split in that paper. There are currently three splits of this dataset: {| border="1" |+ A summary of the characteristics of the various splits of the 20 Newsgroups dataset. ! Split !! Class Count !! Document Count !! Clustering !! Classification |- ! align="left" | full_set | 20 || 19997 || Yes || Yes |- !align="left" | reduced_set | 10 || 6000 || Yes || Yes |- !align="left" | tiny_set | 4 || 400 || Yes || No |- !align="left" | indices_broad_reduction_5000 | 20 || 4999 || Yes || Yes |} A more complete description of each of these splits follows. ====full_set==== {| border="1" |+ The composition of the full_set split of the 20 Newsgroups dataset ! Component !! Document Count !! Percentage of Split |- ! align="left" | training |13398 || 67.00 |- ! align="left" | dev test |3300 || 16.50 |- ! align="left" | blind test |3299 || 16.50 |- ! align="left" | all |19997 || 100 |} ====reduced_set==== {| border="1" |+ The composition of the reduced_set split of the 20 Newsgroups dataset ! Component !! Document Count !! Percentage of Split |- ! align="left" | training |4000 || 66.667 |- ! align="left" | dev test |1000 || 16.667 |- ! align="left" | blind test |1000 || 16.667 |- ! align="left" | all |6000 || 100 |} ====tiny_set==== The tiny_set was made exclusively for testing clustering algorithms on a very small dataset. This split is not suitable for classification purposes. {| border="1" |+ The composition of the tiny_set split of the 20 Newsgroups dataset ! Component !! Document Count !! Percentage of Split |- ! align="left" | all |400 || 100 |} ====indices_broad_reduction_5000==== This split was created as a reduced set with representation from a larger number of classes. {| border="1" |+ The composition of the indices_broad_reduction_5000 split of the 20 Newsgroups dataset ! Component !! Document Count !! Percentage of Split |- ! align="left" | training |3359 || 67.19 |- ! align="left" | dev test |819 || 16.38 |- ! align="left" | blind test |821 || 16.42 |- ! align="left" | all |4999 || 100 |} === Book of Mormon === This data set was created by Dan Walker and contains a single split suitable only for clustering, as no natural labels have been applied. The documents in this data set are individual verses extracted from the version of the Book of Mormon available from [http://www.gutenberg.org Project Gutenberg]. The documents have been pre-processed, so that individual tokens are separated by whitespace. Tokens include words and punctuation characters.