Parsing the Web: Large-Scale Syntactic Processing

Scalable syntactic processing will underpin the sophisticated language technology needed for next generation information access. Companies are already using NLP tools to create web-scale question answering and semantic search engines. Massive amounts of parsed web data will also allow the automatic creation of semantic knowledge resources on an unprecedented scale. The web is a challenging arena for syntactic parsing, because of its scale and variety of styles, genres, and domains. Our proposal is to scale and adapt an existing wide-coverage parser to the web; evaluate and run this parser on Wikipedia, a large and semi-structured text collection; use the parsed wiki data for an innovative form of bootstrapping to make the parser both more efficient and more accurate; and finally use the parsed web data for a variety of NLP semantic tasks, including a novel combination of distributional and compositional semantics to improve performance on tasks which require fine-grained syntax/semantic intergration.

The focus of the proposal will be the C&C parser [1], a state-of-the-art statistical parser based on Combinatory Cat- egorial Grammar (CCG), a formalism which originated in the syntactic theory literature. A strength of the parser is that it is theoretically well-motivated at all levels, from the grammar formalism which enables the parser to produce lin- guistically sophisticated output representing the underlying meaning of a sentence to the machine learning techniques which underpin its robustness and accuracy. The parser has been evaluated on a number of standard test sets achieving state-of-the-art accuracy. It has also recently been adapted successfully to the biomedical domain. The parser is surpris- ingly efficient, given its detailed output, processing tens of sentences per second. For web-scale text processing, we aim to make the parser an order of magnitude faster still. The C&C parser is one of only very few parsers currently available which has the potential to produce detailed, accurate analyses at the scale we are considering.

Weekly Update Slides: Week 1, Week 2, Week 3
Final Presentation
Final Report

 

Team Members
Senior Members
Stephen ClarkUniversity of Cambridge
Ann CopestakeUniversity of Cambridge
James R. CurranUniversity of Sydney, Australia
Graduate Students
Byung Gyu AhnCLSP
James HaggertyUniversity of Sydney
Aurelie HerbelotCambridge University
Yue ZhangOxford University
Undergraduate Students
Tim DawbornUniversity of Sydney
Jonathan KummerfeldUniversity of Sydney
Jessika RoesnerUniversity of Texas at Austin
Curt Van WykNorthwestern University

Center for Language and Speech Processing