Language Modeling Issues
for Spanish Language Large
Vocabulary Continuous Speech Recognition
Project Charter:
The goal of this workshop project is to explore language
modeling techniques to improve recognition of unrestricted,
conversational Spanish, over telephone channels. The basic
training and test data will be from the Spanish language
component of the Linguistic Data Consortium's Call Home
corpus. This is a corpus of transcribed telephone
conversations. Text corpora, as well as other sources of
transcribed speech, will be available.
We will be starting with a baseline Spanish speech
recognizer built with BBN's Byblos speech recognition
system. The workshop will be provided with N-best and/or
lattice outputs from this recognizer. We will endeavor to
develop and evaluate language models for improving on the
baseline performance level. In particular it will be
desirable to exploit specific aspects of the Spanish
language to improve the performance of the recognizer. The
N-best lists and lattices will provide one means for
evaluating our ideas and perplexity measurements another.
Our progress will also be measured by our improved
understanding of how language characteristics should
influence our choice of a language model for recognition.
Project Team:
Herb Gish (Project
Leader)
German Bordel
Lin Chase
Pierre Dupont
Carol Van Ess-Dykema
Jose Oncina
Eric Wheeler
(student assistant)
Resources:
Presently, the following are available:
- The LDC Spanish Call Home corpus:
- Transcriptions: /export/a/clsp/LM95/data/callhome-spanish
- Lattices: /export/a/clsp/LM95/lats/lm95_bbn_Spanish
- Recordings of conversations: (To be installed soon)
- The ECI Spanish corpus:
- Transcriptions: /export/a/clsp/LM95/data/eci-spanish
- Various news wire feed corpora:
- Transcriptions provided by BBN: /export/a/clsp/LM95/data/newswire1
- Transcriptions provided by DoD: (To be installed soon)
- Spanish Dictionaries provided by BBN:
- Phonetic dictionary: /export/a/clsp/LM95/etc/sp80-8.6k-open.dict
- Morphological dictionary: /export/a/clsp/LM95/etc/Spanish04
Have you seen:
-
The CLSP
Homepage.
-
The CLSP `95 LM Workshop
Homepage.