Scalable syntactic processing will underpin the sophisticated language technology needed for next generation information access. Companies are already using NLP tools to create web-scale question answering and semantic search engines. Massive amounts of parsed web data will also allow the automatic creation of semantic knowledge resources on an unprecedented scale. The web is a challenging arena for syntactic parsing, because of its scale and variety of styles, genres, and domains. Our proposal is to scale and adapt an existing wide-coverage parser to the web; evaluate and run this parser on Wikipedia, a large and semi-structured text collection; use the parsed wiki data for an innovative form of bootstrapping to make the parser both more efficient and more accurate; and finally use the parsed web data for a variety of NLP semantic tasks, including a novel combination of distributional and compositional semantics to improve performance on tasks which require fine-grained syntax/semantic intergration.
The focus of the proposal will be the C&C parser , a state-of-the-art statistical parser based on Combinatory Cat- egorial Grammar (CCG), a formalism which originated in the syntactic theory literature. A strength of the parser is that it is theoretically well-motivated at all levels, from the grammar formalism which enables the parser to produce lin- guistically sophisticated output representing the underlying meaning of a sentence to the machine learning techniques which underpin its robustness and accuracy. The parser has been evaluated on a number of standard test sets achieving state-of-the-art accuracy. It has also recently been adapted successfully to the biomedical domain. The parser is surpris- ingly efficient, given its detailed output, processing tens of sentences per second. For web-scale text processing, we aim to make the parser an order of magnitude faster still. The C&C parser is one of only very few parsers currently available which has the potential to produce detailed, accurate analyses at the scale we are considering.
|Stephen Clark||University of Cambridge|
|Ann Copestake||University of Cambridge|
|James R. Curran||University of Sydney, Australia|
|Byung Gyu Ahn||CLSP|
|James Haggerty||University of Sydney|
|Aurelie Herbelot||Cambridge University|
|Yue Zhang||Oxford University|
|Tim Dawborn||University of Sydney|
|Jonathan Kummerfeld||University of Sydney|
|Jessika Roesner||University of Texas at Austin|
|Curt Van Wyk||Northwestern University|