The overall performance of machine-learned NLP systems is often ultimately determined by the size of the training data rather than the learning algorithms themselves [Banko and Brill 2001]. The web undoubtedly offers the largest textual data set. Previous researches that use the web as the corpus have mostly relied on search engines to obtain the frequency counts and/or contexts of given phrases [Lapata & Keller 2005]. Unfortunately, this is hopelessly inefficient when building large-scale lexical resources.
We propose to build a system for acquiring lexical knowledge from ngram counts of the web data. Since multiple occurrences of the same string are collapsed to a single one, the ngram data is considerably smaller than the original text. Since most lexical learning algorithms only collect data from small windows of text anyway, the ngram data can provide the necessary statistics needed for the learning tasks in a much more compact and efficient fashion. Ngram counts may appear to be a rather impoverished data source. However, a surprisingly large variety of knowledge can be mined from them. For example, consider the referents of the pronoun ‘his’ in the following sentences:
(1) John needed his friends
(2) John needed his support
(3) John offered his support
The fact that (1) and (3) have a different coreference relationship than (2) seems to hinge on a piece of ‘world knowledge’ that one never needs one’s own support (since one already has it). [Bergsma and Lin, 2006] showed that such seemingly ‘deep’ world knowledge can actually be obtained from shallow POS-tagged ngram statistics.
|Satoshi Sekine||New York University|
|Shane Bergsma||University of Alberta|
|Kapil Dalwani||Johns Hopkins University|
|Sushant Narsale||Johns Hopkins University|
|Emily Pitler||University of Pennsylvania|
|Rachel Lathbury||University of Virginia|
|Vikram Rao||Cornell University|