Before a computer can try to understand or translate a human sentence, it must identify the phrases and diagram the grammatical relationships among them. This is called parsing.
State-of-the-art parsers correctly guess over 90% of the phrases and relationships, but make some errors on nearly half the sentences analyzed. Many of these errors distort any subsequent automatic interpretation of the sentence.
Much of the problem is that these parsers, which are statistical, are not “trained” on enough example parses to know about many of the millions of potentially related word pairs. Human labor can produce more examples, but still too few by orders of magnitude.
In this project, we seek to achieve a quantum advance by automatically generating large volumes of novel training examples. We plan to bootstrap from up to 350 million words of raw newswire stories, using existing parsers to generate the new parses together with confidence measures.
We will use a method called co-training, in which several reasonably good parsing algorithms collaborate to automatically identify one another’s weaknesses (errors) and to correct them by supplying new example parses to one another. This accuracy-boosting technique has widespread application in other areas of machine learning, natural language processing and artificial intelligence.
Numerous challenges must be faced: how do we parse 350 million words of text in less than a year (we have 6 weeks)? How to use partly incompatible parsers to train one another? Which machine learning techniques scale up best? What kind of grammars, probability models, and confidence measures work best? The project will involve a significant amount of programming, but the rewards should be high.
|Rebecca Hwa||University of Maryland|
|Miles Osborne||University of Edinburgh|
|Anoop Sarkar||University of Pennsylvania|
|Mark Steedman||University of Edinburgh|
|Stephen Clark||University of Edinburgh|
|Julia Hockenmaier||University of Edinburgh|
|Steven Baker||Cornell University|