CDG-based Language Models Mary Harper
- 07/14/2004
- Abstract:
This talk concerns our research on the development of effective and efficient language models (LMs) for large vocabulary continuous speech recognition (LVCSR) systems. For this research, we selected Constraint Dependency Grammar (CDG) because the formalism is able to represent properties of a wide variety of languages and because CDG parses can be lexicalized at the word level with a rich set of lexical features for modeling subcategorization and wh-movement without a blow-up of the
parameter space. Two types of LMs have been developed: an almost-parsing
LM and a full parser-based LM. The quality of these LMs gained significantly from the insights obtained from initial CDG grammar induction experiments. The almost-parsing LM uses a data structure derived from CDG parses called a SuperARV that tightly integrates knowledge of words, lexical features, and syntactic constraints. The full CDG parser-based LM utilizes complete parse information obtained by adding modifiee links to the SuperARVs assigned to each word in a sentence. We have evaluated the almost-parsing LM on several LVCSR tasks and found that it reduces recognition error rates significantly with a much lower time complexity than full parser-based approaches. The full CDG parser-based LM, when evaluated on the DARPA Wall Street Journal CSR task, outperforms the almost-parsing LM and obtains a performance comparable to or exceeding the state-of-the-art parser-based LMs.
|