"Repetition, Adaptation and Language Modeling"

Full Presentation: [ .ps | .pdf ]

Kenneth Ward Church
Department Head, AT&T Labs-Research
Florham Park, NJ, USA.

Repetition is very common. Adaptive language models were introduced to account for the fact that words (and their variant forms) tend to appear in bursts. We will show that this is especially true for words with a lot of content such as proper nouns, technical terminology and good keywords for information retrieval. A proper noun like ``Kennedy'' is more likely to be repeated in a Brown Corpus document than a common noun like ``showed,'' even though both words are about equally frequent. We find that words (and ngrams) with more content tend to be more bursty than words (and ngrams) with less content, all other things being equal. Measures borrowed from Information Retrieval, term frequency and document frequency, will be used to predict both the average frequency and the variance (burstiness) of a word. The literature on adaptive language models has studied the first moment in considerable detail, but has tended to ignore the second moment.