The Center for Language and Speech Processing




About CLSP
About CLSP
Upcoming Seminar

Bill Byrne
November 24th
4:30PM
CSEB Room B17
"Hierarchical Phrase-based Translation with Weighted Finite State Transducers "

More information »

Workshops

Unsupervised Acquisition of Lexical Knowledge from N-Grams

The overall performance of machine-learned NLP systems is often ultimately determined by the size of the training data rather than the learning algorithms themselves [Banko and Brill 2001]. The web undoubtedly offers the largest textual data set. Previous researches that use the web as the corpus have mostly relied on search engines to obtain the frequency counts and/or contexts of given phrases [Lapata & Keller 2005]. Unfortunately, this is hopelessly inefficient when building large-scale lexical resources.

We propose to build a system for acquiring lexical knowledge from ngram counts of the web data. Since multiple occurrences of the same string are collapsed to a single one, the ngram data is considerably smaller than the original text. Since most lexical learning algorithms only collect data from small windows of text anyway, the ngram data can provide the necessary statistics needed for the learning tasks in a much more compact and efficient fashion. Ngram counts may appear to be a rather impoverished data source. However, a surprisingly large variety of knowledge can be mined from them. For example, consider the referents of the pronoun 'his' in the following sentences:

  1. John needed his friends
  2. John needed his support
  3. John offered his support

The fact that (1) and (3) have a different coreference relationship than (2) seems to hinge on a piece of 'world knowledge' that one never needs one's own support (since one already has it). [Bergsma and Lin, 2006] showed that such seemingly 'deep' world knowledge can actually be obtained from shallow POS-tagged ngram statistics.

Team Members

Senior Members

Graduate Students

Undergraduate Students