Repetition and Language Models and Comparable Corpora – Ken Church (Johns Hopkins University)

When:
September 29, 2009 all-day
2009-09-29T00:00:00-04:00
2009-09-30T00:00:00-04:00

Abstract
I will discuss a couple of non-standard features that I believe could be useful for working with comparable corpora. Dotplots have been used in biology to find interesting DNA sequences. Biology is interested in ordered matches, which show up as (possibly broken) diagonals in dotplots. Information Retrieval is more interested in unordered matches (e.g., cosine similarity), which shows up as squares in dotplots. Parallel corpora have both squares and diagonals multiplexed together. The diagonals tell us what is a translation of what, and the squares tell us what is in the same language. There is also an opportunity to take advantage of repetition in comparable corpora. Repetition is very common. Standard bag-of-word models in Information Retrieval do not attempt to model discourse structure such as given/new. The first mention in a news article (e.g., “Manuel Noriega, former President of Panama”) is different from subsequent mentions (e.g., “Noriega”). Adaptive language models were introduced in Speech Recognition to capture the fact that probabilities change or adapt. After we see the first mention, we should expect a subsequent mention. If the first mention has probability p, then under standard (bag-of words) independence assumptions, two mentions ought to have probability p^2, but we find the probability is actually closer to p/2. Adaptation matters more for meaningful units of text. In Japanese, words (meaningful sequences of characters) are more likely to be repeated than fragments (meaningless sequences of characters from words that happen to be adjacent). In newswire, we find more adaptation for content words (proper nouns, technical terminology, out of vocabulary (OOV) words and good keywords for information retrieval), and less adaptation for function words, cliches and ordinary first names. There is more to meaning than frequency. Content words are not only low frequency, but likely to be repeated.
Biography
MIT undergrad (1978) and grad (1983), followed by 20 years at AT&T Bell Labs (1983-2003) and 6 years at Microsoft Research (2003-2009). Currently, at Hopkins as Chief Scientist of the Human Language Technology Center of Excellence as well as Research Professor in Computer Science. Honors: AT&T Fellow. I have worked on many topics in computational linguistics including: web search, language modeling, text analysis, spelling correction, word-sense disambiguation, terminology, translation, lexicography, compression, speech (recognition and synthesis), OCR, as well as applications that go well beyond computational linguistics such as revenue assurance and virtual integration (using screen scraping and web crawling to integrate systems that traditionally don’t talk together as well as they could such as billing and customer care). When we were reviving empirical methods in the 1990s, we thought the AP News was big (1 million words per week), but since then I have had the opportunity to work with much larger data sets such as telephone call detail (1-10 billion records per month) and web logs (even bigger).

Center for Language and Speech Processing