Weakly Supervised Learning for Wide Coverage Parsing

Research Group of the 2002 Summer Workshop

Before a computer can try to understand or translate a human sentence, it must identify the phrases and diagram the grammatical relationships among them. This is called parsing.

State-of-the-art parsers correctly guess over 90% of the phrases and relationships, but make some errors on nearly half the sentences analyzed. Many of these errors distort any subsequent automatic interpretation of the sentence.

Much of the problem is that these parsers, which are statistical, are not “trained” on enough example parses to know about many of the millions of potentially related word pairs. Human labor can produce more examples, but still too few by orders of magnitude.

In this project, we seek to achieve a quantum advance by automatically generating large volumes of novel training examples. We plan to bootstrap from up to 350 million words of raw newswire stories, using existing parsers to generate the new parses together with confidence measures.

We will use a method called co-training, in which several reasonably good parsing algorithms collaborate to automatically identify one another’s weaknesses (errors) and to correct them by supplying new example parses to one another. This accuracy-boosting technique has widespread application in other areas of machine learning, natural language processing and artificial intelligence.

Numerous challenges must be faced: how do we parse 350 million words of text in less than a year (we have 6 weeks)? How to use partly incompatible parsers to train one another? Which machine learning techniques scale up best? What kind of grammars, probability models, and confidence measures work best? The project will involve a significant amount of programming, but the rewards should be high.

Team Members
Senior Members
Rebecca Hwa	University of Maryland
Miles Osborne	University of Edinburgh
Anoop Sarkar	University of Pennsylvania
Mark Steedman	University of Edinburgh
Graduate Students
Stephen Clark	University of Edinburgh
Julia Hockenmaier	University of Edinburgh
Paul Ruhlen	JHU
Undergraduate Students
Steven Baker	Cornell University
Jeremiah Crim	JHU

Weakly Supervised Learning for Wide Coverage Parsing

Upcoming Seminars

Center for Language and Speech Processing