Scalable syntactic processing will underpin the sophisticated language technology needed for next generation information access. Companies are already using NLP tools to create web-scale question answering and semantic search engines. Massive amounts of parsed web data will also allow the automatic creation of semantic knowledge resources on an unprecedented scale. The web is a challenging arena for syntactic parsing, because of its scale and variety of styles, genres, and domains. Our proposal is to scale and adapt an existing wide-coverage parser to the web; evaluate and run this parser on Wikipedia, a large and semi-structured text collection; use the parsed wiki data for an innovative form of bootstrapping to make the parser both more efficient and more accurate; and finally use the parsed web data for a variety of NLP semantic tasks, including a novel combination of distributional and compositional semantics to improve performance on tasks which require fine-grained syntax/semantic intergration.
The focus of the proposal will be the C&C parser [1], a state-of-the-art statistical parser based on Combinatory Cat- egorial Grammar (CCG), a formalism which originated in the syntactic theory literature. A strength of the parser is that it is theoretically well-motivated at all levels, from the grammar formalism which enables the parser to produce lin- guistically sophisticated output representing the underlying meaning of a sentence to the machine learning techniques which underpin its robustness and accuracy. The parser has been evaluated on a number of standard test sets achieving state-of-the-art accuracy. It has also recently been adapted successfully to the biomedical domain. The parser is surpris- ingly efficient, given its detailed output, processing tens of sentences per second. For web-scale text processing, we aim to make the parser an order of magnitude faster still. The C&C parser is one of only very few parsers currently available which has the potential to produce detailed, accurate analyses at the scale we are considering.
Weekly Update Slides: Week 1, Week 2, Week 3
Final Presentation
Final Report
| Team Members | |
|---|---|
| Senior Members | |
| Stephen Clark | University of Cambridge |
| Ann Copestake | University of Cambridge |
| James R. Curran | University of Sydney, Australia |
| Graduate Students | |
| Byung Gyu Ahn | CLSP |
| James Haggerty | University of Sydney |
| Aurelie Herbelot | Cambridge University |
| Yue Zhang | Oxford University |
| Undergraduate Students | |
| Tim Dawborn | University of Sydney |
| Jonathan Kummerfeld | University of Sydney |
| Jessika Roesner | University of Texas at Austin |
| Curt Van Wyk | Northwestern University |