Unsupervised Acquisition of Lexical Knowledge from N-Grams

Research Group of the 2009 Summer Workshop

The overall performance of machine-learned NLP systems is often ultimately determined by the size of the training data rather than the learning algorithms themselves [Banko and Brill 2001]. The web undoubtedly offers the largest textual data set. Previous researches that use the web as the corpus have mostly relied on search engines to obtain the frequency counts and/or contexts of given phrases [Lapata & Keller 2005]. Unfortunately, this is hopelessly inefficient when building large-scale lexical resources.

We propose to build a system for acquiring lexical knowledge from ngram counts of the web data. Since multiple occurrences of the same string are collapsed to a single one, the ngram data is considerably smaller than the original text. Since most lexical learning algorithms only collect data from small windows of text anyway, the ngram data can provide the necessary statistics needed for the learning tasks in a much more compact and efficient fashion. Ngram counts may appear to be a rather impoverished data source. However, a surprisingly large variety of knowledge can be mined from them. For example, consider the referents of the pronoun ‘his’ in the following sentences:

(1) John needed his friends
(2) John needed his support
(3) John offered his support

The fact that (1) and (3) have a different coreference relationship than (2) seems to hinge on a piece of ‘world knowledge’ that one never needs one’s own support (since one already has it). [Bergsma and Lin, 2006] showed that such seemingly ‘deep’ world knowledge can actually be obtained from shallow POS-tagged ngram statistics.

Final Report

Team Members
Senior Members
Ken Church	Microsoft
Heng Ji	CUNY
Dekang Lin	Google
Satoshi Sekine	New York University
Graduate Students
Kailash Patil	CLSP
Shane Bergsma	University of Alberta
Kapil Dalwani	Johns Hopkins University
Sushant Narsale	Johns Hopkins University
Emily Pitler	University of Pennsylvania
Undergraduate Students
Rachel Lathbury	University of Virginia
Vikram Rao	Cornell University

Unsupervised Acquisition of Lexical Knowledge from N-Grams

Upcoming Seminars

Center for Language and Speech Processing