A Simple, Corpus-Based Method for Finding Base Noun Phrases
Claire Cardie, Cornell University
March 31, 1998
Finding simple, non-recursive, base noun phrases is an important subtask for many natural language processing applications. While previous empirical methods for base NP identification have been rather complex, this talk instead propose a very simple algorithm that is tailored to the relative simplicity of the task. In particular, the talk will present a corpus-based approach for finding base NPs by matching part-of-speech tag sequences. The training phase of the algorithm is based on two successful techniques: first the base NP grammar is read from a "treebank'' corpus (a la Charniak); then the grammar is improved by selecting rules with high "benefit'' scores (a la Brill). Using this simple algorithm with a nave heuristic for matching rules, we achieve suprising accuracy in an evaluation on the Penn Treebank Wall Street Journal.