Before a computer can try to understand or translate a human sentence,
it must identify the phrases and diagram the grammatical relationships
among them. This is called parsing.
State-of-the-art parsers correctly guess over 90% of the phrases and
relationships, but make some errors on nearly half the sentences
analyzed. Many of these errors distort any subsequent automatic
interpretation of the sentence.
Much of the problem is that these parsers, which are statistical, are
not "trained" on enough example parses to know about many of the
millions of potentially related word pairs. Human labor can produce
more examples, but still too few by orders of magnitude.
In this project, we seek to achieve a quantum advance by automatically
generating large volumes of novel training examples. We plan to
bootstrap from up to 350 million words of raw newswire stories, using
existing parsers to generate the new parses together with confidence
measures.
We will use a method called co-training, in which several reasonably
good parsing algorithms collaborate to automatically identify one
another's weaknesses (errors) and to correct them by supplying new
example parses to one another. This accuracy-boosting technique has
widespread application in other areas of machine learning, natural
language processing and artificial intelligence.
Numerous challenges must be faced: how do we parse 350 million words
of text in less than a year (we have 6 weeks)? How to use partly
incompatible parsers to train one another? Which machine learning
techniques scale up best? What kind of grammars, probability models,
and confidence measures work best? The project will involve a
significant amount of programming, but the rewards should be high.