It-Bank: an online repository for annotated instances of the pronoun "it"


[IT-BANK] Tar'd and gzipped: 170 KB compressed

If you use this data in your work, please cite as:

Please send an e-mail to if you use the data. We'd also be happy to help if you need any assistance.

Also, please e-mail us if you would like to contribute any labelled data to It-Blank. We only make available sentences where the word "it" occurs. We also randomize the order of the sentences. Fair-use guidelines allow excerpting sentences from published work for educational purposes. Contributors, however, retain the copyright on their annotations.

The Data

Each annotated sentence contains an instance of the English pronoun "it". Each instance is annotated as being either referential, labelled as "1", or non-referential, labelled as "0". Please see the above paper for an explanation of referential vs. non-referential pronouns, as well as our annotation guidelines and inter-annotator agreement statistics. Each example is a tab-separated triple: a label, a position, and a sentence:

1 1 It is not expected to cause him to miss any games .
1 10 More frequent droughts would make water even scarcer than it is today .
0 7 She adds , however , that it 's impossible to be certain the cells were motor neurons .

The sentence is a space-separated list of tokens corresponding to one sentence. All sentences were automatically tokenized and segmented using Dekang Lin's Language and Text Analysis Tools (D. Lin. 2001. LaTaT: Language and Text Analysis Tools. In Proceedings of Human Language Technology Conference 2001. pp.222--227. 2001). Note that this automatic processing does occasionally introduce errors in tokenization or segmentation.

The second component of the triple, the position, identifies which token in the sentence corresponds to the label. The position always corresponds to an "it" token. The position number is needed because sentences may have more than one occurrence of the word it.

Please let us know if you observe any obvious errors or inconsistencies in the labellings only. We will not fix errors in tokenization or segmentation.

Shane Bergsma
March 31, 2008