Evaluation Data for Supervised Classifiers

This page provides some of the data used in the experiments for our ACL 2010 paper on creating robust supervised classifiers. We include the original set of 100 documents downloaded from the Project Gutenberg website, as well as the adjective-ordering and spelling examples derived from this corpus. For the Medline data, the original Medline documents can be accessed at the Medline FTP site. The derived adjective and spelling examples are provided below. More details are provided in the paper.


All files have been compressed and archived using "tar -czvf" To extract the archive, please run "tar -xzvf", e.g. "tar -xzvf GutenbergSept2009.documents.tgz" creates a directory containing the Gutenberg documents.

[Project Gutenberg Corpus]
100 Gutenberg documents.

[Gutenberg adjectives]
Adjectives pairs extracted from Gutenberg documents.

[Gutenberg spellings]
Spelling examples extracted from Gutenberg documents.

[Medline adjectives]
Adjectives pairs extracted from Medline documents.

[Medline spellings]
Spelling examples extracted from Medline documents.


If you use this data in your work, please cite as:

Adjective Examples

The adjective examples are a comma-separated list of the label and the two adjectives (in alphabetical order). When the label is 1, the alphabetical order is correct. Otherwise, when the label is 0, the reverse order is correct:


That is, higher precedes specific but numerous precedes large in the source text.

Spelling Examples

For the spelling examples, there are two fields separated by a tab character: the index and the entity+context. The entity+context is a space-separated list of tokens. The index identifies the position of the entity in the context. E.g.:

4     as to the best site for Michelangelo 's gigantic

The instance at position 4 is the word site. The task is to predict the usage of site, using the context, as the most likely member of the confusion set {site, sight, cite}.

Please send an e-mail to sbergsma@jhu.edu if you use the data. We'd also be happy to help if you need any assistance.

Shane Bergsma
May 4, 2010