Natural Language Parsing: Graphs, the A* Algorithm, and Modularity – Christopher Manning (Stanford University)
View Seminar Video
Probabilistic parsing methods have in recent years transformed our ability to robustly find correct parses for open domain sentences. But people normally still think of parsers in terms of logical presentations via the notion of “parsing as deduction”. I will instead connect stochastic parsing with finding shortest paths in hypergraphs, and show how this approach naturally provides a chart parser for arbitrary probabilistic context-free grammars (finding shortest paths in a hypergraph is easy; the central problem of parsing is that the hypergraph has to be constructed on the fly). Running such a parser exhaustively, I will briefly consider the properties of the Penn Treebank (the most used hand-parsed corpus): the vast parsing ambiguity that results from these properties and how simple models can accurately predict the amount of work a parser does on this corpus.
Using the hypergraphical viewpoint, a natural approach is to use the A* algorithm to cut down the work in finding the best parse. On unlexicalized grammars, this can reduce the parsing work done dramatically, by at least 97%. This approach is competitive with methods standardly used in statistical parsers, while ensuring optimality, unlike most heuristic approaches to best-first parsing.
Finally, I will present a novel modular generative model in which semantic (lexical dependency) and syntactic structures are scored separately. This factored model is conceptually simple, linguistically interesting, and provides straightforward opportunities for separately improving the component models. Further, it provides a level of performance close to that of similar, non-factored models. And most importantly, unlike other modern parsing models, the factored model permits the continued use of an extremely effective A* algorithm, which makes efficient, exact inference feasible.
This is joint work with Dan Klein.
Christopher Manning is an Assistant Professor of Computer Science and Linguistics at Stanford University. He received his Ph.D. from Stanford University in 1995, and served on the faculty of the Computational Linguistics Program at Carnegie Mellon University (1994-1996) and the University of Sydney Linguistics Department (1996-1999) before returning to Stanford. His research interests include probabilistic models of language, natural language parsing, constraint-based linguistic theories, syntactic typology, information extraction and text mining, and computational lexicography. He is the author of three books, including Foundations of Statistical Natural Language Processing (MIT Press, 1999, with Hinrich Schuetze).