Recent Topics in Speech Recognition Research at NTT Laboratories – Sadaoki Furui (Furui Research Laboratory, NTT Human Interface Laboratories)
This talk introduces two recent topics in speech recognition research at Furui Research Laboratory, NTT Human Interface Laboratories.The first topic is large-vocabulary continuous speech recognition (LVCSR) using Japanese business newspaper task. No research efforts have ever been reported for Japanese LVCSR. This is mainly because Japanese sentences are written without spaces between words, so it is difficult to estimate a word N-gram language model. We designed and recorded a Japanese read-speech corpus with text obtained from the Nikkei newspaper. To enable word N-grams to be used, sentences were first segmented into words (morphemes) using a morphological analyzer. A word-frequency list (WFL) was formed from 6.8M sentences from about 5 years of newspaper articles, and this yielded a list of 600K words. We selected three sets of recorded texts according to vocabulary sizes: the top 7K, 30K, and 150K words of the WFL. We recorded 5,400 utterances by 54 speakers. We also developed an LVCSR system using context-dependednt phoneme HMMs and bigram and trigram grammars. For the 7K vocabulary, the word error rates were 16.2% and 11.2 % for bigram and trigram grammars, respectively. These results showed that the N-gram language model and the context-dependent acoustic models are also very effective in reducing errors for Japanese LVCSR.The second topic is a new paradigm for unsupervised instantaneous speaker adaptation, which uses the input utterance itself for adaptation. Since voice individuality is phoneme-dependent, speaker adaptation must be performed model-dependently. However, it is impossible to obtain a complete model sequence, that is, what is spoken, for each input utterance, especially for speakers who have many recognition errors when using speaker-independent models. Therefore, how to perform model-dependent adaptation without knowing the correct model sequence is a crucial issue. If all possible model sequences were hypothesized and used for adaptation, the number of calculations would become enormous. We have recently proposed a new adaptation method, in which N-best hypotheses are created by applying speaker-independent phone models to each input utterance, and speaker adaptation based on a constrained MAP estimation technique is then applied to each hypothesis. Using this method, the likelihood of a correct hypothesis existing in a low rank with speaker-independent models rises, and, as a result, recognition accuracy increases. Experimental results for several continuous speech recognition tasks show that recognition accuracy is increased by this method, even for speakers who have very low accuracy with speaker-independent models.