Training and Testing Data for Anaphora Resolution

If you use this data in your work, please cite as:


We labelled third person pronoun-antecedent pairs in 118 documents from the slate section of the American National Corpus. There are 1398 labelled pronouns in 79 documents in the training set and 1381 labelled pronouns in 41 documents in the test set. Most of the slate documents are ``gist'' articles which provide factual background information for stories currently in the news. Only pronouns that refer to noun phrases given previously in the text are used in our system. Thus we label and ignore pronouns referring to implicit entities not specifically mentioned, cataphora (e.g., ``After he was elected, president Clinton...''), and pleonastic pronouns without antecedent (e.g., ``it is raining''). Of the 2779 total pronouns labelled, 219 are so identified.

We are pleased to share our anaphora resolution labels, but unfortunately, we are not permitted to share the original text articles from the American National Corpus. The people at the ANC have made it possible for other researchers to reconstruct the coreference-labelled data sets once they acquire the original ANC files. They converted our tagged data to stand-off annotations which can be merged with the original documents to re-tag the files. The next release of the American National Corpus will have facilities to easily merge these stand-off annotations into the original documents.

Update: The second release is now available and the instructions for installing the annotations is available here.

Of course, for those that already have the first release, there is a simple way to merge the current annotations into your documents. Since every occurence of a third-person pronominal in the given articles is labelled, in order, in the annotation files, you can merge the annotations with the documents by inserting the next label in the list each time you encounter a pronoun in the corresponding ANC file. For your convenience, we list the labelled pronouns below:


Directory listings of the labelled files are available through the following links:

The training set
The test set

The filenames in these directories correspond to the filenames of the original anc files. Note that each directory also contains a zipped version of the directory contents containing all the stand-off coreference annotation files.

Information about acquiring the American National Corpus and other details are available at their website:

Thanks to Dekang Lin, Nancy Ide and Keith Suderman for their help in making this data available. If anyone needs any assistance, please e-mail me.