CLSP Logo

Statistical Natural Language Processing

Eugene Charniak
Brown University

Over the last ten years or so the field of natural language processing (NLP) has become increasingly dominated by corpus-based methods and statistical techniques. In this research problems are attacked by collecting statistics from a corpus (sometimes marked with correct answers, sometimes not) and then applying the statistics to new instances of the task. In this talk we give an overview of statistical techniques in four areas of NLP: parsing (finding the correct phrase structure for a sentence), lexical semantics (learning meanings and other properties of words and phrases from text), anaphora resolution (determining the intended antecedent of pronouns, and noun phrases in general), and word-sense disambiguation (finding the correct sense in context of a word with multiple meanings). As a general rule, corpus-based, and particularly statistical techniques outperform hand-crafted systems, and the rate of progress in the field is still quite high.