Scalable syntactic processing will underpin the sophisticated language technology needed for next generation information access. Companies are already using NLP tools to create web-scale question answering and semantic search engines. Massive amounts of parsed web data will also allow the automatic creation of semantic knowledge resources on an unprecedented scale. The web is a challenging arena for syntactic parsing, because of its scale and variety of styles, genres, and domains. Our proposal is to scale and adapt an existing wide-coverage parser to the web; evaluate and run this parser on Wikipedia, a large and semi-structured text collection; use the parsed wiki data for an innovative form of bootstrapping to make the parser both more efficient and more accurate; and finally use the parsed web data for a variety of NLP semantic tasks, including a novel combination of distributional and compositional semantics to improve performance on tasks which require fine-grained syntax/semantic intergration.
The focus of the proposal will be the C&C parser [1], a state-of-the-art statistical parser based on Combinatory Cat- egorial Grammar (CCG), a formalism which originated in the syntactic theory literature. A strength of the parser is that it is theoretically well-motivated at all levels, from the grammar formalism which enables the parser to produce lin- guistically sophisticated output representing the underlying meaning of a sentence to the machine learning techniques which underpin its robustness and accuracy. The parser has been evaluated on a number of standard test sets achieving state-of-the-art accuracy. It has also recently been adapted successfully to the biomedical domain. The parser is surpris- ingly efficient, given its detailed output, processing tens of sentences per second. For web-scale text processing, we aim to make the parser an order of magnitude faster still. The C&C parser is one of only very few parsers currently available which has the potential to produce detailed, accurate analyses at the scale we are considering.
Weekly Update Slides: Week 1, Week 2, Week 3
Final Presentation
Final Report
Team Members | |
---|---|
Senior Members | |
Stephen Clark | University of Cambridge |
Ann Copestake | University of Cambridge |
James R. Curran | University of Sydney, Australia |
Graduate Students | |
Byung Gyu Ahn | CLSP |
James Haggerty | University of Sydney |
Aurelie Herbelot | Cambridge University |
Yue Zhang | Oxford University |
Undergraduate Students | |
Tim Dawborn | University of Sydney |
Jonathan Kummerfeld | University of Sydney |
Jessika Roesner | University of Texas at Austin |
Curt Van Wyk | Northwestern University |