|
Johns Hopkins University 3400 N. Charles Street
|
![]() |
| My Thesis Work:
My thesis advisor is Gerard Meyer. I'm working on the general problem of how to train speech recognizers more cost effectively. Obviously, abundant speech audio is available, but without the corresponding text transcription that we currently rely on for training. The cost of collecting the audio is minimal, but the cost of transcription is huge. Some good results have been presented (BBN, LIMSI) that use the recognizer itself to provide a transcription that is then used in subsequent training, relying on a confidence measure or a correlation with existing close-captioned text to select regions which are appropriate to feedback into training. I initially tried to mimic these automatic transcription approaches to preparing training data using HTK and a simple acoustic corpus, the OGI-alphadigit corpus, that is the letters A-Z and the digits 0-9 spoken in English by ~3000 speakers. What I found contradicts the successes that are reported in literature: no matter how carefully I chose which automatic transcriptions to use, as long as there are errors in the transcription, my models will degrade, and rather quickly at that. But this led me to an interesting discovery. Starting with a seed model trained on a small portion of the available training data, I recognize on the remaining potential training data. I then observe the number of errors the recognizer made in each sentence (6 alphadigits, for my case study) and select the sentences that have the most errors. I feed the most errorful back into training, using the human transcription. I iteratively do this and ultimately select 35% of the training data to use to build a model that gets an error rate of 9.4%. The baseline error rate using all the training data is 10.3%. These results are reported in [Kamm 2001]. After this discovery, I did a literature search and found that my training algorithm fits into the class of "Active Learning using Selective Sampling" algorithms. In [Cohn], the selective sampling theory is developed in the context of learning a binary concept in the absence of noise, and it is shown how it may be approximately implemented in a neural network. It is also suggested that selective sampling "is well suited to problems such as speech recognition", but no indication of how to apply such an approach to the complex problem of speech recognition is given. As far as I can tell, my simple case study is the first successful application of selective sampling in speech recognition training. I then went on to extend the algorithm to use untranscribed speech, with the goal of selecting a portion to pay a human to transcribe that will most contribute to reducing the error rate. (Using my case study data, I pretend that I have no human transcription other than the seed model training data.) To do this I apply the same iterative algorithm, but instead of observing error rate I use a confidence measure to pick sentences. By selecting sentences that the recognizer gives low confidence, and then having a human transcribe the sentences, I can achieve the baseline error rate of 10.3% after transcribing only 25% of the corpus. Further, I can reduce the error rate to 10.1% by transcribing 65% of the data. These results are reported in [Kamm 2002]. [Cohn] David Cohn, Les Atlas and Richard Ladner. (1994) Improving
generalization with active learning, Machine Learning 15(2):201-221.
|
| Publications: from newest to oldest.
[Kamm 2004a] T. M. Kamm and G. G. L. Meyer, Robustness Aspects of Active Learning for Acoustic Modeling, submitted to ICSLP 2004, Jeju Island, Korea, October 2004. [Kamm 2004] T. M. Kamm, Active Learning for Acoustic Speech Recognition Modeling, Ph.D. Dissertation, Johns Hopkins University, 2004. [Kamm 2003] T. M. Kamm and G. G. L. Meyer, Word-Selective Training for Speech Recognition, in Proc. IEEE Workshop Automatic Speech Recognition and Understanding, December 2003. [Kamm 2002] T. M. Kamm and G. G. L. Meyer, Selective Sampling of Training Data for Speech Recognition, in Proc. Human Language Technology, March 2002. [Kamm 2001] T. M. Kamm and G. G. L. Meyer, Automatic Selection of Transcribed Training Material, in Proc. IEEE Workshop Automatic Speech Recognition and Understanding, December 2001. [Martin 1997] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, The DET Curve in Assessment of Detection Task Performance, Proceedings of the 5th European Conference on Speech Communication and Technology, vol. 4, pp. 1895-1898, Rhodes, Greece, September 1997. |