Low Development Cost, High Quality Speech Recognition for New Languages and Domains

The cost of developing speech to text systems for new languages and domains is dominated by the need to transcribe a large quantity of data. We aim to significantly reduce this cost.

In the speaker identification community, limitations on the amount of enrollment data per speaker are dealt with by adapting a “Universal Background Model” (UBM) to the observations from a given speaker. Subspace based techniques can be used in this process to reduce the number of speaker specific parameters that must be trained from the enrollment data. One approach that has been successful in achieving this goal is factor analysis. We have recently performed experiments on speech recognition showing that this factor analysis based approach can beat state of the art techniques. The improvements are particularly large when the amount of training data is small, e.g. 20% relative improvement on a Maximum Likelihood trained fully adapted Broadcast News system with 50 hours of training data. The smaller number of parameters of the UBM based system means that less training data is needed. Another advantage of the UBM framework is that it allows natural parameter tying across domains and languages, which should further reduce the amount of training data needed when migrating to a new language. We anticipate particularly large reductions in WER when training on small amounts of language-specific data, e.g. a few hours.

The UBM based framework for speech recognition is scientifically interesting as it represents a unification of speech recognition and speaker identification techniques. Speaker identification techniques, which were originally based on those used for speech recognition, have been extended in recent years by the Universal Background Model and the Factor Analysis approach. Our approach brings those ideas back into speech recognition, and we anticipate that the techniques developed may in turn improve speaker verification performance (however, that is not the focus of this workshop). The purpose of the workshop would be to bring top speech recognition and speaker identification researchers together to work on this technique which straddles the two fields, and we would apply it to speech recognition for under-resourced languages; however, the techniques developed would have much wider applicability.

Since a workable approach to apply UBMs to speech recognition has already been devised, the pre-workshop phase would be able to focus on preparing data, building baseline systems, and coding the existing UBM based approach within an open-source framework based on the HTK toolkit for eventual release. During the workshop we can focus on optimizing and extending the techniques used in UBM based modeling, studying cross-language effects, developing tools to reduce the labor of building a pronunciation dictionary, and packaging our setup for use by others.

The approach we intend to pursue is of enormous scientific interest for both speech recognition and speaker identification, as it concerns the core modeling approach used in both communities. We will make the tools we develop available and easy to use even for non-experts, so our work should have direct benefits for those who need to build effective speech recognition systems as well as having research and educational purposes. Given our positive initial results, this will be valuable regardless of the outcome of our experiments during the workshop.

Final Presentation
Final Report


Team Members
Senior Members
Lukas BurgetBrno University of Technology
Nagendra Kumar GoelApptek Inc.
Dan PoveyMicrosoft
Richard RoseMcGill University
Graduate Students
Samuel ThomasCLSP
Arnab GhoshalJohns Hopkins University
Petr SchwarzBrno University of Technology
Undergraduate Students
Mohit AgarwalIIIT Allahabad
Pinar AkyaziBogacizi University

Center for Language and Speech Processing