Unsupervised Acquisition of Lexical Knowledge from N-Grams

The overall performance of machine-learned NLP systems is often ultimately determined by the size of the training data rather than the learning algorithms themselves [Banko and Brill 2001]. The web undoubtedly offers the largest textual data set. Previous researches that use the web as the corpus have mostly relied on search engines to obtain the frequency counts and/or contexts of given phrases [Lapata & Keller 2005]. Unfortunately, this is hopelessly inefficient when building large-scale lexical resources.

We propose to build a system for acquiring lexical knowledge from ngram counts of the web data. Since multiple occurrences of the same string are collapsed to a single one, the ngram data is considerably smaller than the original text. Since most lexical learning algorithms only collect data from small windows of text anyway, the ngram data can provide the necessary statistics needed for the learning tasks in a much more compact and efficient fashion. Ngram counts may appear to be a rather impoverished data source. However, a surprisingly large variety of knowledge can be mined from them. For example, consider the referents of the pronoun ‘his’ in the following sentences:

(1) John needed his friends
(2) John needed his support
(3) John offered his support

The fact that (1) and (3) have a different coreference relationship than (2) seems to hinge on a piece of ‘world knowledge’ that one never needs one’s own support (since one already has it). [Bergsma and Lin, 2006] showed that such seemingly ‘deep’ world knowledge can actually be obtained from shallow POS-tagged ngram statistics.

Final Report

 

Team Members
Senior Members
Ken ChurchMicrosoft
Heng JiCUNY
Dekang LinGoogle
Satoshi SekineNew York University
Graduate Students
Kailash PatilCLSP
Shane BergsmaUniversity of Alberta
Kapil DalwaniJohns Hopkins University
Sushant NarsaleJohns Hopkins University
Emily PitlerUniversity of Pennsylvania
Undergraduate Students
Rachel LathburyUniversity of Virginia
Vikram RaoCornell University

Center for Language and Speech Processing