A Simple, Corpus-Based Method for Finding Base Noun Phrases

Claire Cardie, Cornell University

March 31, 1998


Abstract

Finding simple, non-recursive, base noun phrases is an important subtask for many natural language processing applications. While previous empirical methods for base NP identification have been rather complex, this talk instead propose a very simple algorithm that is tailored to the relative simplicity of the task. In particular, the talk will present a corpus-based approach for finding base NPs by matching part-of-speech tag sequences. The training phase of the algorithm is based on two successful techniques: first the base NP grammar is read from a "treebank'' corpus (a la Charniak[1996]); then the grammar is improved by selecting rules with high "benefit'' scores (a la Brill[1993]). Using this simple algorithm with a nave heuristic for matching rules, we achieve suprising accuracy in an evaluation on the Penn Treebank Wall Street Journal.