Bill Byrne
November 24th
4:30PM
CSEB Room B17
"Hierarchical Phrase-based Translation with Weighted Finite State Transducers "
Workshops
Parsing the Web: Large-Scale Syntactic Processing
Scalable syntactic processing will underpin the sophisticated language technology needed for next generation information access. Companies are already using NLP tools to create web-scale question answering and semantic search engines. Massive amounts of parsed web data will also allow the automatic creation of semantic knowledge resources on an unprecedented scale. The web is a challenging arena for syntactic parsing, because of its scale and variety of styles, genres, and domains. Our proposal is to scale and adapt an existing wide-coverage parser to the web; evaluate and run this parser on Wikipedia, a large and semi-structured text collection; use the parsed wiki data for an innovative form of bootstrapping to make the parser both more efficient and more accurate; and finally use the parsed web data for a variety of NLP semantic tasks, including a novel combination of distributional and compositional semantics to improve performance on tasks which require fine-grained syntax/semantic intergration.
The focus of the proposal will be the C&C parser [1], a state-of-the-art statistical parser based on Combinatory Cat- egorial Grammar (CCG), a formalism which originated in the syntactic theory literature. A strength of the parser is that it is theoretically well-motivated at all levels, from the grammar formalism which enables the parser to produce lin- guistically sophisticated output representing the underlying meaning of a sentence to the machine learning techniques which underpin its robustness and accuracy. The parser has been evaluated on a number of standard test sets achieving state-of-the-art accuracy. It has also recently been adapted successfully to the biomedical domain. The parser is surpris- ingly efficient, given its detailed output, processing tens of sentences per second. For web-scale text processing, we aim to make the parser an order of magnitude faster still. The C&C parser is one of only very few parsers currently available which has the potential to produce detailed, accurate analyses at the scale we are considering.
Team Members
Senior Members
- Stephen Clark, University of Cambridge
- Ann Copestake, University of Cambridge
- James R. Curran, University of Sydney, Australia
Graduate Students
- Yue Zhang, Oxford University
- Aurelie Herbelot, Cambridge University
- James Haggerty, University of Sydney
- Byung-Gyu Ahn, Johns Hopkins University
Undergraduate Students
- Curt Van Wyk, Northwestern University
- Jessika Roesner, University of Texas at Austin
- Jonathan Kummerfeld, University of Sydney
- Tim Dawborn, University of Sydney
Weekly Update Slides: Week 1, Week 2, Week 3
Final Presentation


