Training and Testing Data for Anaphora Resolution
If you use this data in your work, please cite as:
Automatic Acquisition of Gender Information for Anaphora Resolution,
In Canadian AI 2005,
© Springer Verlag),
Victoria, British Columbia, May 9-11, 2005, pages 342-353.
We labelled third person pronoun-antecedent
pairs in 118 documents from the slate section of the American National
Corpus. There are 1398 labelled pronouns in 79 documents in the
training set and 1381 labelled pronouns in 41 documents in the test
set. Most of the slate documents are ``gist'' articles which provide
factual background information for stories currently in the news.
Only pronouns that refer to noun phrases given previously in the text
are used in our system. Thus we label and ignore pronouns referring to
implicit entities not specifically mentioned, cataphora (e.g., ``After
he was elected, president Clinton...''), and
pleonastic pronouns without antecedent (e.g., ``it
is raining''). Of the 2779 total pronouns labelled, 219 are so identified.
We are pleased to share our anaphora resolution labels, but unfortunately,
we are not permitted to share the original text articles from the
American National Corpus. The people at the ANC have made it possible for
other researchers to reconstruct the coreference-labelled data sets once
they acquire the original ANC files. They converted our tagged data to
stand-off annotations which can be merged with the original documents to
re-tag the files. The next release of the American National Corpus will
have facilities to easily merge these stand-off annotations into the original
Update: The second release is now available and the instructions for
installing the annotations is available here.
Of course, for those that already have the first release, there is a simple
way to merge the current annotations into your documents. Since every
occurence of a third-person pronominal in the given articles is labelled,
in order, in the annotation files, you can merge the annotations with the
documents by inserting the next label in the list each time you encounter
a pronoun in the corresponding ANC file.
For your convenience, we list the labelled pronouns below:
Directory listings of the labelled files are available through the following links:
The training set
The test set
The filenames in these directories correspond to the filenames of the original
anc files. Note that each directory also contains a zipped version of the directory
contents containing all the stand-off coreference annotation files.
Information about acquiring the American National Corpus and other details
are available at their website:
Thanks to Dekang Lin, Nancy Ide and Keith Suderman for their help in
making this data available. If anyone needs any assistance, please