Parsing the Web: Large-Scale Syntactic Processing

Scalable syntactic processing will underpin the sophisticated language technology needed for next generation information access. Companies are already using NLP tools to create web-scale question answering and semantic search engines. Massive amounts of parsed web data will also allow the automatic creation of semantic knowledge resources on an unprecedented scale. The web is a challenging arena for syntactic parsing, because of its scale and variety of styles, genres, and domains. Our proposal is to scale and adapt an existing wide-coverage parser to the web; evaluate and run this parser on Wikipedia, a large and semi-structured text collection; use the parsed wiki data for an innovative form of bootstrapping to make the parser both more efficient and more accurate; and finally use the parsed web data for a variety of NLP semantic tasks, including a novel combination of distributional and compositional semantics to improve performance on tasks which require fine-grained syntax/semantic intergration.

The focus of the proposal will be the C&C parser [1], a state-of-the-art statistical parser based on Combinatory Cat- egorial Grammar (CCG), a formalism which originated in the syntactic theory literature. A strength of the parser is that it is theoretically well-motivated at all levels, from the grammar formalism which enables the parser to produce lin- guistically sophisticated output representing the underlying meaning of a sentence to the machine learning techniques which underpin its robustness and accuracy. The parser has been evaluated on a number of standard test sets achieving state-of-the-art accuracy. It has also recently been adapted successfully to the biomedical domain. The parser is surpris- ingly efficient, given its detailed output, processing tens of sentences per second. For web-scale text processing, we aim to make the parser an order of magnitude faster still. The C&C parser is one of only very few parsers currently available which has the potential to produce detailed, accurate analyses at the scale we are considering.

Weekly Update Slides: Week 1, Week 2, Week 3
Final Presentation
Final Report

Team Members
Senior Members
Stephen Clark University of Cambridge
Ann Copestake University of Cambridge
James R. Curran University of Sydney, Australia
Graduate Students
Byung Gyu Ahn CLSP
James Haggerty University of Sydney
Aurelie Herbelot Cambridge University
Yue Zhang Oxford University
Undergraduate Students
Tim Dawborn University of Sydney
Jonathan Kummerfeld University of Sydney
Jessika Roesner University of Texas at Austin
Curt Van Wyk Northwestern University

Center for Language and Speech Processing