Bill Byrne's Publications: Articles followed by conference papers

[1] V. Venkataramani, S. Chakrabartty, and W. Byrne. Ginisupport vector machines for segmental minimum Bayes risk decoding of continuous speech. Computer Speech and Language. Submitted.
[ bib | .pdf ]

We describe the use of Support Vector Machines (SVMs) for continuous speech recognition by incorporating them in Segmental Minimum Bayes Risk decoding. Lattice cutting is used to convert the Automatic Speech Recognition search space into sequences of smaller recognition problems. SVMs are then trained as discriminative models over each of these problems and used in a rescoring framework. We pose the estimation of a posterior distribution over hypothesis in these regions of acoustic confusion as a logistic regression problem. We also show that GiniSVMs can be used as an approximation technique to estimate the parameters of the logistic regression problem. On a small vocabulary recognition task we show that the use of GiniSVMs can improve the performance of a well trained Hidden Markov Model system trained under the Maximum Mutual Information criterion. We also find that it is possible to derive reliable confidence scores over the GiniSVM hypotheses and that these can be used to good effect in hypothesis combination. We discuss the problems that we expect to encounter in extending this approach to Large Vocabulary Continuous Speech Recognition and describe initial investigation of constrained estimation techniques to derive feature spaces for SVMs.
[2] S. Kumar and W. Byrne. A weighted finite state transducer translation template model for statistical machine translation. Journal of Natural Language Engineering. Submitted.
[ bib ]

We present a Weighted Finite State Transducer Translation Template Model for statistical machine translation. The approach we describe allows us to implement each constituent distribution of the model as a weighted finite state transducer or acceptor. We show that bitext word alignment and translation under the model can be performed with standard FSM operations involving these transducers. One of the benefits of using this framework is that it avoids the need to develop specialized search procedures, even for the generation of lattices or N-Best lists of bitext word alignments and translation hypotheses. We report and analyze bitext word alignment and translation performance of the model on French-English and Chinese-English tasks.
[3] V. Doumpiotis, S. Tsakalidis, and W. Byrne. Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation. IEEE Transactions on Speech and Audio Processing. Accepted, to appear.
[ bib | .pdf ]

Linear transforms have been used extensively for training and adaptation of HMM-based ASR systems. Recently procedures have been developed for the estimation of linear transforms under the Maximum Mutual Information (MMI) criterion. In this paper we introduce discriminative training procedures that employ linear transforms for feature normalization and for speaker adaptive training. We integrate these discriminative linear transforms into MMI estimation of HMM parameters for improvement of large vocabulary conversational speech recognition systems.
[4] Vlasios Doumpiotis and William Byrne. Lattice segmentation and minimum Bayes risk discriminative training for large vocabulary continuous speech recognition. Speech Communication. Submitted.
[ bib | .pdf ]

Lattice segmentation techniques developed for Minimum Bayes Risk decoding in large vocabulary speech recognition tasks are used to compute the statistics for discriminative training algorithms that estimate HMM parameters so as to reduce the overall risk over the training data. New estimation procedures are developed and evaluated for small vocabulary and large vocabulary recognition tasks, and additive performance improvements are shown relative to maximum mutual information estimation. These relative gains are explained through a detailed analysis of individual word recognition errors.
[5] V. Goel, S. Kumar, and W. Byrne. Segmental minimum Bayes-risk decoding for automatic speech recognition. IEEE Transactions on Speech and Audio Processing, May 2004.
[ bib | http ]

Minimum Bayes-Risk (MBR) speech recognizers have been shown to yield improvements over the search over word lattices. We present a Segmental Minimum Bayes-Risk decoding (SMBR) framework that simplifies the implementation of MBR recognizers through the segmentation of the N-best lists or lattices over which the recognition is to be performed. This paper presents lattice cutting procedures that underly SMBR decoding. Two of these procedures are based on a risk minimization criterion while a third one is guided by word-level confidence scores. In conjunction with SMBR decoding, these lattice segmentation procedures give consistent improvements in recognition word error rate (WER) on the Switchboard corpus. We also discuss an application of risk-based lattice cutting to multiplesystem SMBR decoding and show that it is related to other system combination techniques such as ROVER. This strategy combines lattices produced from multiple ASR systems and is found to give WER improvements in a Switchboard evaluation system.
[6] W. Byrne, D. Doermann, M. Franz, S. Gustman, J. Hajic, D. Oard, M. Picheny, J. Psutka, B. Ramabhadran, D. Soergel, T. Ward, and W.-J. Zhu. Automatic recognition of spontaneous speech for access to multilingual oral history archives. IEEE Transactions on Speech and Audio Processing, Special Issue on Spontaneous Speech Processing, July 2004.
[ bib ]

The MALACH project has the goal of developing the technologies needed to facilitate access to large collections of spontaneous speech. Its aim is to dramatically improve the state of the art in key Automatic Speech Recognition (ASR), Natural Language Processing (NLP) technologies for use in large-scale retrieval systems. The project leverages a unique collection of oral history interviews with survivors of the Holocaust that has been assembled and extensively annotated by the Survivors of the Shoah Visual History Foundation. This paper describes the collection, 116,000 hours of interviews in 32 languages, and the way in which system requirements have been discerned through user studies. It discusses ASR methods for very difficult speech (heavily accented, emotional, and elderly spontaneous speech), including transcription to create training data and methods for language modeling and speaker adaptation. Results are presented for for English and Czech. NLP results are presented for named entity tagging, topic segmentation, and supervised topic classification, and the architecture of an integrated search system that uses these results is described.
[7] V. Goel and W. Byrne. Minimum Bayes-risk automatic speech recognition. In W. Chou and B.-H. Juang, editors, Pattern Recognition in Speech and Language Processing. CRC Press, 2003.
[ bib ]

[8] F. Zheng, Z. Song, P. Fung, and W. Byrne. Mandarin pronunciation modeling based on the CASS corpus. Journal of Computer Science and Technology (Science Press, Beijing, China), 17(3), May 2002.
[ bib | .pdf ]

The pronunciation variability is an important issue that must be faced with when developing practical automatic spontaneous speech recognition systems. In this paper, the factors that may affect the recognition performance are analyzed, including those specific to the Chinese language. By studying the INITIAL/FINAL (IF) characteristics of Chinese language and developing the Bayesian equation, we propose the concepts of generalized INITIAL/FINAL (GIF) and generalized syllable (GS), the GIF modeling and the IF-GIF modeling, as well as the context-dependent pronunciation weighting, based on a well phonetically transcribed seed database. By using these methods, the Chinese syllable error rate (SER) was reduced by 6.3% and 4.2% compared with the GIF modeling and IF modeling respectively when the language model, such as syllable or word N-gram, is not used. The effectiveness of these methods is also proved when more data without the phonetic transcription is used to refine the acoustic model using the proposed iterative force-alignment based transcribing (IFABT) method, achieving a 5.7% SER reduction.
[9] A. Gunawardana and W. Byrne. Discounted likelihood linear regression for rapid speaker adaptation. Computer Speech and Language, 15(1):15-38, Jan 2001.
[ bib ]

The widely used maximum likelihood linear regression speaker adaptation procedure suffers from overtraining when used for rapid adaptation tasks in which the amount of adaptation data is severely limited. This is a well known difficulty associated with the estimation maximization algorithm. We use an information geometric analysis of the estimation maximization algorithm as an alternating minimization of a Kullback-Leibler-type divergence to see the cause of this difficulty, and propose a more robust discounted likelihood estimation procedure. This gives rise to a discounted likelihood linear regression procedure, which is a variant of maximum likelihood linear regression suited for small adaptation sets. Our procedure is evaluated on an unsupervised rapid adaptation task defined on the Switchboard conversational telephone speech corpus, where our proposed procedure improves word error rate by 1.6% (absolute) with as little as five seconds of adaptation data, which is a situation in which maximum likelihood linear regression overtrains in the first iteration of adaptation. We compare several realizations of discounted likelihood linear regression with maximum likelihood linear regression and other simple maximum likelihood linear regression variants, and discuss issues that arise in implementing our discounted likelihood procedures.
[10] V. Goel and W. Byrne. Minimum Bayes-Risk automatic speech recognition. Computer Speech and Language, 14(2):115-135, 2000.
[ bib ]

In this paper we address the problem of efficient implementation of the minimum Bayes-risk classifiers for automatic speech recognition. Simplifying assumptions that allow computationally feasible approximations to these classifiers are proposed. Under these assumptions an approximate implementation as an A-star search algorithm over recognition lattice is constructed. This algorithm improves up on the previously proposed N-best list rescoring implementation of these classifiers. The minimum Bayes-risk classifiers are shown to outperform the most commonly used maximum a-posteriori probability (MAP) classifier on three speech recognition tasks: reduction of word error rate, reduction of content word error rate, and identification of Named Entities in speech. The A-star implementation is also contrasted with the N-best list rescoring implementation and is found to obtain modest but significant improvements in accuracy with little computational overhead.
[11] W. Byrne and A. Gunawardana. Comments on 'Efficient training algorithms for HMM's using incremental estimation'. IEEE Transactions on Speech and Audio Processing, 8(6):751-754, Nov 2000.
[ bib ]

``Efficient Training Algorithms for HMM's using Incremental Estimation'' investigates EM procedures that increase training speed. The authors' claim that these are GEM procedures is incorrect. We discuss why this is so, provide an example of non-monotonic convergence to a local maximum in likelihood, and outline conditions that guarantee such convergence.
[12] M. Riley, W. Byrne, M. Finke, S. Khudanpur, A. Ljolje, J. McDonough, H. Nock, M. Saraclar, C. Wooters, and G. Zavaliagkos. Stochastic pronunciation modeling from hand-labelled phonetic corpora. Speech Communication, November 1999.
[ bib ]

In the early '90s, the availability of the TIMIT read-speech phonetically transcribed corpus led to work at AT&T on the automatic inference of pronunciation variation. This work, briefly summarized here, used stochastic decisions trees trained on phonetic and linguistic features, and was applied to the DARPA North American Business News read-speech ASR task. More recently, the ICSI spontaneous-speech phonetically transcribed corpus was collected at the behest of the 1996 and 1997 LVCSR Summer Workshops held at Johns Hopkins University. A 1997 workshop (WS97) group focused on pronunciation inference from this corpus for application to the DoD Switchboard spontaneous telephone speech ASR task. We describe several approaches taken there. These include (1) one analogous to the AT&T approach, (2) one, inspired by work at WS96 and CMU, that involved adding pronunciation variants of a sequence of one or more words (`multiwords') in the corpus (with corpus-derived probabilities) into the ASR lexicon, and (1+2) a hybrid approach in which a decision-tree model was used to automatically phonetically transcribe a much larger speech corpus than ICSI and then the multiword approach was used to construct an ASR recognition pronunciation lexicon.
[13] W. Byrne and S. Shamma. Neurocontrol in sequence recognition. In O. Omidvar and D. Elliott, editors, Progress in Neural Networks: Neural Networks for Control. Academic Press, 1997.
[ bib | .pdf.gz ]

An artificial neural network intended for sequence modeling and recognition is described. The network is based on a lateral inhibitory network with controlled, oscillatory behavior so that it naturally models sequence generation. Dynamic programming algorithms can be used to transform the network into a sequence recognizer. Markov decision theory is used to develop novel and more ``neural'' recognition control strategies as alternatives to dynamic programming.
[14] S. Young, P. Woodland, and W. Byrne. Spontaneous speech recognition for the credit card corpus using the HTK toolkit. IEEE Transactions on Speech and Audio Processing, 1994.
[ bib ]

This paper describes the speech recognition system which was provided as a baseline for the Summer Workshop on Robust Speech Processing held at the Rutgers CAIP Center in July/August 1993.
[15] W. Byrne. Alternating Minimization and Boltzmann Machine learning. IEEE Transactions on Neural Networks, 3(4):612-620, 1992.
[ bib ]

Training a Boltzmann machine with hidden units is appropriately treated in information geometry using the information divergence and the technique of alternating minimization. The resulting algorithm is shown to be closely related to gradient descent Boltzmann machine learning rules, and the close relationship of both to the EM algorithm is described. An iterative proportional fitting procedure is described and incorporated into the alternating minimization algorithm.
[16] W. Byrne, R. Zapp, P. Flynn, and M. Siegel. Adaptive filter processing in remote heart monitors. IEEE Transactions on Biomedical Engineering, July 1986.
[ bib ]

--- Conference Papers Follow ---
[17] I. Shafran and W. Byrne. Task-specific minimum bayes-risk decoding using learned edit distance. In Proc. of the International Conference on Spoken Language Processing, 2004.
[ bib | .pdf ]

This paper extends the minimum Bayes-risk framework to incorporate a loss function specific to the task and the ASR system. The errors are modeled as a noisy channel and the parameters are learned from the data. The resulting loss function is used in the risk criterion for decoding. Experiments on a large vocabulary conversational speech recognition system demonstrate significant gains of about 1 over untrained lossfunction. The approach is general enough to be applicable to other sequence recognition problems such as in Optical Character Recognition (OCR) and in analysis of biological sequences.
[18] J. Psutka, P. Ircing, J. Hjic, V. Radova, J.V. Psutka, W. Byrne, and S. Gustman. Issues in annotation of the czech spontaneous speech corpus in the MALACH project. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), 2004.
[ bib | .ps ]

The paper present the issues encountered in processing spontaneous Czech speech in the MALACH project. Specific problems connected with a frequent occurrence of colloquial words in spontaneous Czech are analyzed; a partial solution is proposed and experimentally evaluated.
[19] J. Psutka, J. Hajic, and W. Byrne. Slavic languages in the MALACH project. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 2004. Invited Paper in Special Session on Multilingual Speech Processing.
[ bib | .pdf ]

The development of acoustic training material for Slavic languages within the MALACH project is described. Initial experience with the variety of speakers and the difficulties encountered in transcribing Czech, Slovak, and Russian language oral history are described along with ASR recognition results intended investigate the effectiveness of different transcription conventions that address language specific phenomena within the task domain.
[20] S. Kumar and W. Byrne. Minimum Bayes-risk decoding for statistical machine translation. In Proceedings of HLT-NAACL, 2004.
[ bib | .pdf ]

We present Minimum Bayes-Risk (MBR) decoding for statistical machine translation. This statistical approach aims to minimize expected loss of translation errors under loss functions that measure translation performance. We describe a hierarchy of loss functions that incorporate different levels of linguistic information from word strings, word-to-word alignments from an MT system, and syntactic structure from parse-trees of source and target language sentences. We report the performance of the MBR decoders on a Chinese-to-English translation task. Our results show that MBR decoding can be used to tune statistical MT performance for specific loss functions.
[21] V. Doumpiotis and W. Byrne. Pinched lattice minimum bayes risk discriminative training for large vocabulary continuous speech recognition. In Proc. of the International Conference on Spoken Language Processing, 2004.
[ bib | .pdf ]

Iterative estimation procedures that minimize empirical risk based on general loss functions such as the Levenshtein distance have been derived as extensions of the Extended Baum Welch algorithm. While reducing expected loss on training data is a desirable training criterion, these algorithms can be difficult to apply. They are unlike MMI estimation in that they require an explicit listing of the hypotheses to be considered and in complex problems such lists tend to be prohibitively large. To overcome this difficulty, modeling techniques originally developed to improve search efficiency in Minimum Bayes Risk decoding can be used to transform these estimation algorithms so that exact update, risk minimization procedures can be used for complex recognition problems. Experimental results in two large vocabulary speech recognition tasks show improvements over conventionally trained MMIE models.
[22] V. Venkataramani, S. Chakrabartty, and W. Byrne. Support vector machines for segmental minimum Bayes risk decoding of continuous speech. In IEEE Automatic Speech Recognition and Understanding Workshop, 2003.
[ bib | .pdf ]

Segmental Minimum Bayes Risk (SMBR) Decoding involves the refinement of the search space into manageable confusion sets i.e., smaller sets of confusable words. We describe the application of Support Vector Machines (SVMs) as discriminative models for the refined search space. We show that SVMs, which in their basic formulation are binary classifiers of fixed dimensional observations, can be used for continuous speech recognition. We also study the use of GiniSVMs, which is a variant of the basic SVM. On a small vocabulary task, we show this two pass scheme outperforms MMI trained HMMs. Using system combination we also obtain further improvements over discriminatively trained HMMs.
[23] J. Psutka, P. Ircing, J.V. Psutka, V. Radovic, W. Byrne, J. Hajic, Jiri Mirovsky, and Samuel Gustman. Large vocabulary ASR for spontaneous Czech in the MALACH project. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 2003.
[ bib | .pdf ]

This paper describes LVCSR research into the automatic transcription of spontaneous Czech speech in the MALACH (Multilingual Access to Large Spoken Archives) project. This project attempts to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) (www.vhf.org) by advancing the state of the art in automated speech recognition. We describe a baseline ASR system and discuss the problems in language modeling that arise from the nature of Czech as a highly inflectional language that also exhibits diglossia between its written and spontaneous forms. The difficulties of this task are compounded by heavily accented, emotional and disfluent speech along with frequent switching between languages. To overcome the limited amount of relevant language model data we use statistical techniques for selecting an appropriate training corpus from a large unstructured text collection resulting in significant reductions in word error rate. recognition and retrieval techniques to improve cataloging efficiency and eventually to provide direct access into the archive itself.
[24] J. Psutka, P. Ircing, J. V. Psutka, V. Radova, W. Byrne, J. Hajic, and S. Gustman. Towards automatic transcription of spontaneous Czech speech in the MALACH project. In Proceedings of the Text, Speech, and Dialog Workshop, 2003.
[ bib | .pdf ]

Our paper discusses the progress achieved during a one-year effort in building the Czech LVCSR system for the automatic transcription of spontaneously produced testimonies in the MALACH project. The difficulty of this task stems from the highly inflectional nature of the Czech language and is further multiplied by the presence of many colloquial words in spontaneous Czech speech as well as by the need to handle emotional speech filled with disfluencies, heavy accents, age-related coarticulation and language switching. In this paper we concentrate mainly on the acoustic modeling issues - the proper choice of front-end paramterization, the handling of non-speech events in acoustic modeling, and unsupervised acoustic adaptation via MLLR. A method for selecting suitable language modeling data is also briefly discussed.
[25] J. Psutka, I. Iljuchin, P. Ircing, J.V. Psutka, V. Trejbal, W. Byrne, J. Hajic, and S. Gustman. Building LVCSR systems for transcription of spontaneously produced Russian witnesses in the MALACH project: initial steps and first results. In Proceedings of the Text, Speech, and Dialog Workshop, 2003.
[ bib ]

The MALACH project uses the world's largest digital archive of video oral histories collected by the Survivors of the Shoah Visual History Foundation (VHF) and attempts to access such archives by advancing the state-of-the-art in Automatic Speech Recognition and Information Retrieval. This paper discusses the intial steps and first results in building large vocabulary continuous speech recognition (LVCSR) systems for the transcription of Russian witnesses. As the third language processed in the MALACH project (following English and Czech), Russian has posed new ASR challenges, especially in phonetic modeling. Although most of the Russian testimonies were provided by native Russian survivors, the speakers come from many different regions and countries resulting in a diverse collection of accented spontaneous Russian speech.
[26] D. Oard, D. Doermann, B. Dorr, D. He, P. Resnik, W. Byrne, S. Khudanpur, D. Yarowsky, A. Leuski, P. Koehn, and K. Knight. Desperately seeking Cebuano. In Proceedings of HLT-NAACL, 2003.
[ bib | .pdf ]

This paper describes an effort to rapidly develop language resources and component technology to support searching Cebuano news stories using English queries. Results from the first 60 hours of the exercise are presented.
[27] S. Kumar and W. Byrne. A weighted finite state transducer implementation of the alignment template model for statistical machine translation. In Proceedings of HLT-NAACL, 2003.
[ bib | .pdf ]

We present a derivation of the alignment template model for statistical machine translation and an implementation of the model using weighted finite state transducers. The approach we describe allows us to implement each constituent distribution of the model as a weighted finite state transducer or acceptor. We show that bitext word alignment and translation under the model can be performed with standard FSM operations involving these transducers. One of the benefits of using this framework is that it obviates the need to develop specialized search procedures, even for the generation of lattices or N-Best lists of bitext word alignments and translation hypotheses. We evaluate the implementation of the model on the Frenchto- English Hansards task and report alignment and translation performance.
[28] O. Kolak, W. Byrne, and P. Resnik. A generative probabilistic OCR model for NLP applications. In Proceedings of HLT-NAACL, 2003.
[ bib | .pdf ]

In this paper we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in order to make them more useful for NLP tasks. We present an implementation of the model based on finite-state models, demonstrate the model's ability to significantly reduce character and word error rate, and provide evaluation results involving automatic extraction of translation lexicons from printed text.
[29] A. Ikeno, B. Pellom, D. Cer, A. Thornton, J. M. Brenier, D. Jurafsky, W. Ward, and W. Byrne. Issues in recognition of Spanish-accented spontaneous English. In Proceedings of the ISCA and IEEE workshop on Spontaneous Speech Processing and Recognition, Tokyo Institute of Technology, Tokyo, Japan, 2003. ISCA and IEEE.
[ bib | .pdf ]

We describe a recognition experiment and two analytic experiments on a database of strongly Hispanic-accented English. We show the crucial importance of training on the Hispanic-accented data for acoustic model performance, and describe the tendency of Spanish-accented speakers to use longer, and presumably less-reduced, schwa vowels than native-English speakers.
[30] V. Doumpiotis, S. Tsakalidis, and W. Byrne. Discriminative training for segmental minimum Bayes-risk decoding. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 2003.
[ bib | .pdf ]

A modeling approach is presented that incorporates discriminative training procedures within segmental Minimum Bayes-Risk decoding (SMBR). SMBR is used to segment lattices produced by a general automatic speech recognition (ASR) system into sequences of separate decis ion problems involving small sets of confusable words. Acoustic models specialized to discriminate between the competing words in these classes are then applied in subsequent SMBR rescoring passes. Refinement of the search space that allows the use of specialized discriminative models is shown to be an improvement over rescoring with conventionally trained discriminative models.
[31] V. Doumpiotis, S. Tsakalidis, and W. Byrne. Lattice Segmentation and Minimum Bayes Risk Discriminative Training. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 2003.
[ bib | .pdf ]

Modeling approaches are presented that incorporate discriminative training procedures in segmental Minimum Bayes-Risk decoding (SMBR). SMBR is used to segment lattices produced by a general automatic speech recognition (ASR) system into sequences of separate decision problems involving small sets of confusable words. We discuss two approaches to incorporating these segmented lattices in discriminative training. We investigate the use of acoustic models specialized to discriminate between the competing words in these classes which are then applied in subsequent SMBR rescoring passes. Refinement of the search space that allows the use of specialized discriminative models is shown to be an improvement over rescoring with conventionally trained discriminative models.
[32] W. Byrne, S. Khudanpur, W. Kim, S. Kumar, P. Pecina, P.Virga, P. Xu, and D. Yarowsky. The Johns Hopkins University 2003 Chinese-English Machine Translation System. In Machine Translation Summit IX. The Association for Machine Translation in the Americas, 2003.
[ bib | .pdf ]

We describe a Chinese to English Machine Translation system developed at the Johns Hopkins University for the NIST 2003 MT evaluations. The system is based on a Weighted Finite State Transducer implementation of the alignment template translation model for statistical machine translation. The baseline MT system was trained using 100,000 sentence pairs selected from a static bitext training collection. Information retrieval techniques were then used to create specific training collections for each document to be translated. This document-specific training set included bitext and name entities that were then added to the baseline system by augmenting the library of alignment templates. We report translation performance of baseline and IR-based systems on two NIST MT evaluation test sets.
[33] W. Ward, H. Krech, X. Yu, K. Herold, G. Figgs, A. Ikeno, D. Jurafsky, and W. Byrne. Lexicon adaptation for LVCSR: speaker idiosyncracies, non-native speakers, and pronunciation choice. In ISCA ITR Workshop on Pronunciation Modeling and Lexicon Adaptation, 2002.
[ bib | .pdf ]

We report on our preliminary experiments on building dynamic lexicons for native-speaker conversational speech and for foreign-accented conversational speech. Our goal is to build a lexicon with a set of pronunciations for each word, in which the probability distribution over pronunciation is dynamically computed. The set of pronunciations are derived from hand-written rules (for foreign accent) or clustering (for phonetically-transcribed Switchboard data). The dynamic pronunciation-probability will take into account specific characteristics of the speaker as well as factors such as language-model probability, disfluencies, sentence position, and phonetic context.
[34] S. Tsakalidis, V. Doumpiotis, and W. Byrne. Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation. In Proc. of the International Conference on Spoken Language Processing, Denver, Colorado, USA, 2002.
[ bib | .pdf ]

Linear transforms have been used extensively for training and adaptation of HMM-based ASR systems. Recently procedures have been developed for the estimation of linear transforms under the Maximum Mutual Information (MMI) criterion. In this paper we introduce discriminative training procedures that employ linear transforms for feature normalization and for speaker adaptive training. We integrate these discriminative linear transforms into MMI estimation of HMM parameters for improvement of large vocabulary conversational speech recognition systems.
[35] J. Psutka, P. Ircing, J. Psutka, V. Radova, W. Byrne, J. Hajic, S. Gustman, and B. Ramabhadran. Automatic transcription of Czech language oral history in the MALACH project: Resources and initial experiments. In Proceedings of the Text, Speech, and Dialog Workshop, 2002.
[ bib | .pdf ]

In this paper we describe the initial stages of the ASR component of the MALACH project. This project will attempt to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation by advancing the state of the art in automated speech recognition. In order to train the ASR system, it is necessary to manually transcribe a large amount of speech data, identify the appropriate vocabulary, and obtain relevant text for language modeling. We give a detailed description of the speech annotation process; show the specific properties of the spontaneous speech contained in the archives; and present baseline speech recognition results.
[36] D. Oard, D. Demner-Fushman, J. Hajic, B Ramabhadran, S Gustman, W Byrne, D. Soergel, B. Dorr, P. Resnik, and M. Picheney. Cross-language access to recorded speech in the MALACH project. In Proceedings of the Text, Speech, and Dialog Workshop, 2002.
[ bib | .pdf ]

The MALACH project seeks to help users find information in a vast multilingual collection of untranscribed oral history interviews. This paper introduces the goals of the project and focuses on supporting access by users who are unfamiliar with the interview language. It begins with a review of the state of the art in cross-language speech retrieval: approaches that will be investigated in the project are then described. Czech was selected as the first non-English language to be supported; results of an initial experimental with Czech/English cross-language retrieval are reported.
[37] S. Kumar and W. Byrne. Risk based lattice cutting for segmental minimum Bayes-risk decoding. In Proc. of the International Conference on Spoken Language Processing, Denver, Colorado, USA, 2002.
[ bib | .ps ]

Minimum Bayes-Risk (MBR) speech recognizers have been shown to give improvements over the conventional maximum a-posteriori probability (MAP) decoders through N-best list rescoring and A-star search over word lattices. Segmental MBR (SMBR) decoders simplify the implementation of MBR recognizers by segmenting the N-best lists or lattices over which the recognition is performed. We present a lattice cutting procedure that attempts to minimize the total Bayes-Risk of all word strings in the segmented lattice. We provide experimental results on the Switchboard conversational speech corpus showing that this segmentation procedure, in conjunction with SMBR decoding, gives modest but significant improvements over MAP decoders as well as MBR decoders on unsegmented lattices.
[38] S. Kumar and W. Byrne. Minimum Bayes-risk alignment of bilingual texts. In Proc. of the Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA, USA, 2002.
[ bib | .ps ]

We present Minimum Bayes-Risk word alignment for machine translation. This statistical, model-based approach attempts to minimize the expected risk of alignment errors under loss functions that measure alignment quality. We describe various loss functions, including some that incorporate linguistic analysis as can be obtained from parse trees, and show that these approaches can improve alignments of the English-French Hansards.
[39] S. Gustman, D. Soergel, D. Oard, W. Byrne, M. Picheny, B. Ramabhadran, and D. Greenberg. Supporting access to large digital oral history archives. In Proceedings of the Joint Conference on Digital Libraries, 2002.
[ bib | .pdf ]

This paper describes our experience with the creation, indexing, and provision of access to a very large archive of videotaped oral histories - 116,000 hours of digitized interviews in 32 languages from 52,000 survivors, liberators, rescuers, and witnesses of the Nazi Holocaust. It goes on to identify a set of critical research issues that must be addressed if we are to provide full and detailed access to collections of this size: issues in user requirement studies, automatic speech recognition, automatic classification, segmentation, summarization, retrieval, and user interfaces. The paper ends by inviting others to discuss use of these materials in their own research.
[40] F. Zheng, Z. Song, P. Fung, and W. Byrne. Modeling pronunciaiton variation using context-dependent weighting and B/S refined acoustic modeling. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 2001.
[ bib ]

Pronunciation variability is an important issue that must be faced with when developing practical automatic spontaneous speech recognition systems. By studying the initial/final (IF) characteristics of Chinese language and developing the Bayesian equation, we propose the concepts of generalized initial/final (GIF) and generalized syllable (GS), the GIF modeling method and the IF-GIF modeling method, as well as the context-dependent pronunciation weighting method. By using these approaches, the IF-GIF modeling reduces the Chinese syllable error rate (SER) by 6.3% and 4.2% compared with the GIF modeling and IF modeling respectively when the language modeling, such as syllable or word N-gram, is not used.
[41] V. Venkataramani and W. Byrne. MLLR adaptation techniques for pronunciation modeling. In IEEE Workshop on Automatic Speech Recognition and Understanding, Madonna di Campiglio, Italy, 2001.
[ bib | .pdf ]

Multiple regression class MLLR transforms are investigated for use with pronunciation models that predict variation in the observed pronunciations given the phonetic context. Regression classes can be constructed so that MLLR transforms can be estimated and used to model specific acoustic changes associated with pronunciation variation. The effectiveness of this modeling approach is evaluated on the phonetically transcribed portion of the SWITCHBOARD conversational speech corpus.
[42] P. Ircing, P. Krebc, J. Hajic, S. Khudanpur, F. Jelinek, J. Psutka, and W. Byrne. On large vocabulary continuous speech recognition of highly inflectional language - Czech. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 2001.
[ bib ]

[43] A. Gunawardana and W. Byrne. Convergence of DLLR rapid speaker adaptation algorithms. In ISCA ITR-Workshop on Adaptation Methods for Automatic Speech Recognition, 2001.
[ bib | .pdf ]

Discounted Likelihood Linear Regression (DLLR) is a speaker adaptation technique for cases where there is insufficient data for MLLR adaptation. Here, we provide an alternative derivation of DLLR by using a censored EM formulation which postulates additional adaptation data which is hidden. This derivation shows that DLLR, if allowed to converge, provides maximum likelihood solutions. Thus the robustness of DLLR to small amounts of data is obtained by slowing down the convergence of the algorithm and by allowing termination of the algorithm before overtraining occurs. We then show that discounting the observed adaptation data by postulating additional hidden data can also be extended to MAP estimation of MLLR-type adaptation transformations.
[44] A. Gunawardana and W. Byrne. Discriminative speaker adaptation with conditional maximum likelihood linear regression. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 2001.
[ bib | .pdf ]

We present a simplified derivation of the extended Baum-Welch procedure, which shows that it can be used for Maximum Mutual Information (MMI) of a large class of continuous emission density hidden Markov models (HMMs). We use the extended Baum-Welch procedure for discriminative estimation of MLLR-type speaker adaptation transformations. The resulting adaptation procedure, termed Conditional Maximum Likelihood Linear Regression (CMLLR), is used successfully for supervised and unsupervised adaptation tasks on the Switchboard corpus, yielding an improvement over MLLR. The interaction of unsupervised CMLLR with segmental minimum Bayes risk lattice voting procedures is also explored, showing that the two procedures are complimentary.
[45] V. Goel, S. Kumar, and W. Byrne. Confidence based lattice segmentation and minimum Bayes-risk decoding. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), volume 4, pages 2569-2572, Aalborg, Denmark, 2001.
[ bib | .pdf ]

Minimum Bayes Risk (MBR) speech recognizers have been shown to yield improvements over the conventional maximum a-posteriori probability (MAP) decoders in the context of Nbest list rescoring andsearch over recognition lattices. Segmental MBR (SMBR) procedures have been developed to simplify implementation of MBR recognizers, by segmenting the N-best list or lattice, to reduce the size of the search space over which MBR recognition is carried out. In this paper we describe lattice cutting as a method to segment recognition word lattices into regions of low confidence and high confidence. We present two SMBR decoding procedures that can be applied on low confidence segment sets. Results obtained on the Switchboard conversational telephone speech corpus show modest but significant improvements relative to MAP decoders.
[46] W. Byrne, V. Venkataramani, T. Kamm, T.F. Zheng, Z. Song, P. Fung, Y. Lui, and U. Ruhi. Automatic generation of pronunciation lexicons for Mandarin casual speech. In IEEE Conference on Acoustics, Speech and Signal Processing, volume 1, pages 569-572, Salt Lake City, Utah, 2001. IEEE.
[ bib | .pdf ]

Pronunciation modeling for large vocabulary speech recognition attempts to improve recognition accuracy by identifying and modeling pronunciations that are not in the ASR systems pronunciation lexicon. Pronunciation variability in spontaneous Mandarin is studied using the newly created CASS corpus of phonetically annotated spontaneous speech. Pronunciation modeling techniques developed in English are applied to this corpus to train pronunciaton models when are then applied in Mandarin Broadcast News transcription.
[47] D. Vergyri, S. Tsakalidis, and W. Byrne. Minimum risk acoustic clustering for multilingual acoustic model compination. In International Conference on Spoken Language Processing, 2000.
[ bib | .pdf ]

In this paper we describe procedures for combining multiple acoustic models, obtained using training corpora from different languages, in order to improve ASR performance in languages for which large amounts of training data are not available. We treat these models as multiple sources of information whose scores are combined in a log-linear model to compute the hypothesis likelihood. The model combination can either be performed in a static way, with constant combination weights, or in a dynamic way, with parameters that can vary for different segments of a hypothesis. The aim is to optimize the parameters so as to achieve minimum word error rate. In order to achieve robust parameter estimation in the dynamic combination case, the parameters are defined to be piecewise constant on different phonetic classes that form a partition of the space of hypothesis segments. The partition is defined, using phonological knowledge, on segments that correspond to hypothesized phones. We examine different ways to define such a partition, including an automatic approach that gives a binary tree structured partition which tries to achieve the minimum WER with the minimum number of classes.
[48] J. McDonough and W. Byrne. On the incremental addition of regression classes for speaker adaptation. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 2000.
[ bib ]

[49] A. Gunawardana and W. Byrne. Robust estimation for rapid adaptation using discounted likelihood techniques. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2000.
[ bib | .pdf ]

The discounted likelihood procedure, which is a robust extension of the usual EM procedure, is presented, and two approximations which lead to two different variants of the usual MLLR adaptation scheme are introduced. These schemes are shown to robustly estimate speaker adaptation transforms with very little data. The evaluation is carried out on the Switchboard corpus.
[50] V. Goel, S. Kumar, and W. Byrne. Segmental minimum Bayes-risk asr voting strategies. In Proc. of the International Conference on Spoken Language Processing, volume 3, pages 139-142, Beijing, China, 2000.
[ bib | .pdf ]

ROVER and its successor voting procedures have been shown to be quite effective in reducing the recognition word error rate (WER). The success of these methods has been attributed to their minimum Bayes-risk (MBR) nature: they produce the hypothesis with the least expected word error. In this paper we develop a general procedure within the MBR framework, called segmental MBR recognition, that encompasses current voting techniques and allows further extensions that yield lower expected WER. It also allows incorporation of loss functions other than the WER. We present a derivation of voting procedure of N-best ROVER as an instance of segmental MBR recognition. We then present an extension, called e-ROVER, that alleviates some of the restrictions of N-best ROVER by better approximating the WER. e-ROVER is compared with N-best ROVER on multi-lingual acoustic modeling task and is shown to yield modest yet significant and easily obtained improvements.
[51] W. Byrne, P. Beyerlein, J. Huerta, S. Khudanpur, B. Marthi, J. Morgan, N. Peterek, J. Picone, D. Vergyri, and W. Wang. Towards language independent acoustic modeling. In IEEE Conference on Acoustics, Speech and Signal Processing, pages 1029-1032, Istanbul, Turkey, 2000. IEEE.
[ bib | .pdf ]

We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge-based and automatic methods to map phonetic units from the source languages to the target language. We employed HMM adaptation techniques and Discriminative Model Combination to combine acoustic models from the individual source languages for recognition of speech in the target language. Experiments are described in which Czech Broadcast News is transcribed using acoustic models trained from small amounts of Czech read speech augmented by English, Spanish, Russian, and Mandarin acoustic models.
[52] LI A., ZHENG F., W. Byrne, P. Fung, T. Kamm, LIU Yi, SONG Z., U. Ruhi, V. Venkataramani, and CHEN X. CASS: A phonetically transcribed corpus of mandarin spontaneous speech. In Proc. of the International Conference on Spoken Language Processing, 2000.
[ bib | .pdf ]

A collection of Chinese spoken language has been collected and phonetically annotated to capture spontaneous speech and language effects. The Chinese Annotated Spontaneous Speech (CASS) corpus contains phonetically transcribed spontaneous speech. This corpus was created to begin to collect samples of most of the phonetic variations in Mandarin spontaneous speech due to pronunciation effects, including allophonic changes, phoneme reduction, phoneme deletion and insertion, as well as duration changes. It is intended for use in pronunciation modeling for improved automatic speech recognition and will be used at the 2000 Johns Hopkins University Language Engineering Workshop by the project on Pronunciation Modeling of Mandarin Casual Speech.
[53] J. McDonough and W. Byrne. Speaker adaptation with all-pass transforms. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1999.
[ bib | .pdf ]

In recent work, a class of transforms were proposed which achieve a remapping of the frequency axis much like conventional vocal tract length normalization. These mappings, known collectively as all-pass transforms (APT), were shown to produce substantial improvements in the performance of a large vocabulary speech recognition system when used to normalize incoming speech prior to recognition. In this application, the most advantageous characteristic of the APT was its cepstral-domain linearity; this linearity makes speaker normalization simple to implement, and provides for the robust estimation of the parameters characterizing individual speakers. In the current work, we exploit the APT to develop a speaker adaptation scheme in which the cepstral means of a speech recognition model are transformed to better match the speech of a given speaker. In a set of speech recognition experiments conducted on the Switchboard Corpus, we report reductions in word error rate of 3.7% absolute.
[54] J. McDonough and W. Byrne. Single-pass adapted training with all-pass transforms. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 1999.
[ bib | .pdf ]

In recent work, the all-pass transform (APT) was proposed as the basis of a speaker adaptation scheme intended for use with a large vocabulary speech recognition system. It was shown that APT-based adaptation reduces to a linear transformation of cepstral means, much like the better known maximum likelihood linear regression (MLLR), but is specified by far fewer free parameters. Due to its linearity, APT-based adaptation can be used in conjunction with speaker-adapted training (SAT), an algorithm for performing maximum likelihood estimation of the parameters of an HMM when speaker adaptation is to be employed during both training and test. In this work, we propose a refinement of SAT called single-pass adapted trainingB (SPAT) which achieves the same improvement in system performance as SAT but requires much less computation for HMM training. In a set of speech recognition experiments conducted on the Switchboard Corpus, we report a word error rate reduction of 5.3% absolute using a single, global APT.
[55] V. Goel and W. Byrne. Task dependent loss functions in speech recognition: Application to named entity extraction. In ESCA-ETR Workshop on accessing information in spoken audio, 1999.
[ bib ]

[56] V. Goel and W. Byrne. Task dependent loss functions in speech recognition: A-star search over recognition lattices. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 1999.
[ bib | .pdf ]

A recognition strategy that can be matched to specific system performance criteria has recently been found to yield improvem ents over the usual maximum a posteriori probability strategy. Some examples of different system performance criteria are word error rate (WER), F-measure for Named Entity extraction tasks, and word-specific errors for keyword spotting tasks. In the match ed-to-the-task strategy the hypothesis is chosen to minimize the expected loss or the Bayes Risk under a loss function defined by th e performance measure of interest. Due to the prohibitively expensive implementation of this strategy, only an approximate implemen tation as an N-best list rescoring scheme has been used so far. Our goal is to improve the performance of such risk-based dec oders by developing search strategies that can incorporate more acoustic evidence. In this paper we present search algorithms to implement the risk-based recognition strategy over word lattices that contain acoustic and language model scores. These algorithms are extensions of the N-best list rescoring approximation and are formulated as A-star algorithms. We first present a single stack A-star search and show how to obtain an under-estimate and an over-estimate of the cost needed for the search. For loss functions that do not depend on time segmentation of hypotheses, a prefix-tree based simpl ification of the single stack algorithm is then derived. For yet a further subset of loss functions, including the usual Levenshtei n distance based loss for WER reduction tasks, we describe a search organization that facilitates further efficiencies in computatio n and storage. Finally we present a path equivalence criterion for merging of prefix tree nodes during search to allow for a larger search space. We find that restricted loss functions yield the most efficient search procedures. However the general single stack search can be applied quite broadly even in principle to loss functions that measure semantic agreement between sentences. Preliminary experiments were performed for WER reduction task on the Switchboard corpus, dev-test set of the 1997 JHU-LVCSR workshop. We obtain an error rate reduction of 0.8-0.9% absolute over a baseline of 38.5% WER. The search speed is comparable to the N-best list rescoring procedure which is much more restrictive in the amount of hypotheses considered for search and produces slightly inferior results (0.5-0.6% absolute improvement). At the conference we will present the framework of task dependent recognition strategy, its implementation as A-star search, and the speed and accuracy comparison of the search with N-best list rescoring procedure.
[57] V. Digalakis, S. Berkowitz, E. Bochieri, C. Boulis, W. Byrne, H. Collier, A. Corduneanu, A. Kannan, S. Khudanpur, J. McDonough, and A. Sankar. Rapid speech recognizer adaptation to new speakers. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 1999.
[ bib | .pdf ]

This paper summarizes the work of the ``Rapid Speech Recognizer Adaptation'' team in the workshop held at Johns Hopkins University in the summer of 1998. The project addressed the modeling of dependencies between units of speech with the goal of making more effective use of small amounts of data for speaker adaptation. A variety of methods were investigated and their effectiveness in a rapid adaptation task defined on the SWITCHBOARD conversational speech corpus is reported.
[58] W. Byrne, J. Hajic, P. Ircing, F. Jelinek, S. Khudanpur, J. McDonough, N. Peterek, and J. Psutka. Large vocabulary speech recognition for read and broadcast Czech. In Proceedings of the Text, Speech, and Dialog Workshop, 1999.
[ bib | .pdf ]

We describe read speech and broadcast news corpora collected as part of a multi-year international collaboration for the development of large vocabulary speech recognition systems in the Czech language. Initial investigations into language modeling for Czech automatic speech recognition are described and preliminary recognition results on the read speech corpus are presented.
[59] W. Byrne and A. Gunawardana. Discounted likelihood linear regression for rapid adaptation. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 1999.
[ bib | .pdf ]

Rapid adaptation schemes that employ the EM algorithm may suffer from overtraining problems when used with small amounts of adaptation data. An algorithm to alleviate this problem is derived within the information geometric framework of Csiszár and Tusnády, and is used to improve MLLR adaptation on NAB and Switchboard adaptation tasks. It is shown how this algorithm approximately optimizes a discounted likelihood criterion.
[60] W. Byrne and A. Gunawardana. Convergence of EM variants. In IEEE Information Theory Workshop on Detection, Estimation, Classification, and Imaging, page 64, 1999.
[ bib | .pdf ]

[61] W. Byrne, P. Beyerlein, J. Huerta, S. Khudanpur, B. Marthi, J. Morgan, N. Peterek, J. Picone, D. Vergyri, and W. Wang. Towards language independent acoustic modeling. In IEEE Workshop on Automatic Speech Recognition and Understanding, Keystone, Colorado, 1999.
[ bib | .pdf ]

We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge based and automatic methods to map phonetic units from the source languages to the target language. We employed HMM adaptation techniques and Discriminative Model Combination to combine acoustic models from the individual source languages for recognition of speech in the target language. Experiments are described in which Czech Broadcast News is transcribed using acoustic models trained from small amounts of Czech read speech augmented by English, Spanish, Russian, and Mandarin acoustic models.
[62] John McDonough, W. Byrne, and X. Luo. Speaker normalization with all-pass transforms. In International Conference on Spoken Language Processing, 1998.
[ bib | .pdf ]

Speaker normalization is a process in which the short-time features of speech from a given speaker are transformed so as to better match some speaker independent model. Vocal tract length normalization (VTLN) is a popular speaker normalization scheme wherein the frequency axis of the short-time spectrum associated with a speaker's speech is rescaled or warped prior to the extraction of cepstral features. In this work, we develop a novel speaker normalization scheme by exploiting the fact that frequency domain transformations similar to that inherent in VTLN can be accomplished entirely in the cepstral domain through the use of conformal maps. We propose a class of such maps, designated all-pass transforms for reasons given hereafter, and in a set of speech recognition experiments conducted on the Switchboard Corpus demonstrate their capacity to achieve word error rate reductions of 3.7% absolute.
[63] V. Goel, W. Byrne, and S. Khudanpur. LVCSR rescoring with modified loss functions: a decision theoretic perspective. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1998.
[ bib | .pdf ]

In this work, the problem of speech decoding is viewed in a Decision Theoretic framework. A modified speech decoding procedure to minimize the expected word error rate is formulated in this framework, and its implementation in N-best list rescoring is presented. Preliminary experiments on the Switch-board show a small but statistically significant error rate improvements.
[64] W. Byrne, M. Finke, S. Khudanpur, A. Ljolje, J. McDonough, H. Nock H, M. Riley, M. Saraclar, C. Wooters, and G. Zavaliagkos. Stochastic pronunciation modeling from hand-labeled phonetic corpora. In Proceedings of the Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition, 1998.
[ bib | .pdf ]

[65] W. Byrne, M. Finke, S. Khudanpur, J. McDonough, H. Nock, M. Riley, M. Saraclar, C. Wooters, and G. Zavaliagkos. Pronunciation modelling using a hand-labelled corpus for conversational speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1998.
[ bib | .pdf ]

Accurately modelling pronunciation variability in conversational speech is an important component of an automatic speech recognition system. We describe some of the projects undertaken in this direction during and after WS97, the Fifth LVCSR Summer Workshop, held at Johns Hopkins University, Baltimore, in July- August, 1997. We first illustrate a use of hand-labelled phonetic transcriptions of a portion of the Switchboard corpus, in conjunction with statistical techniques, to learn alternatives to canonical pronunciations of words. We then describe the use of these alternate pronunciations in an automatic speech recognition system. We demonstrate that the improvement in recognition performance from pronunciation modelling persists as the system is enhanced with better acoustic and language models.
[66] W. Byrne, S. Khudanpur, E. Knodt, and J. Bernstein. Is automatic speech recognition ready for non-native speech? a data collection effort and initial experiments in modeling conversational Hispanic english. In ESCA-ITR Workshop on speech technology in language learning, 1997.
[ bib | .pdf ]

We describe the protocol used for collecting a corpus of conversational English speech from non-native speakers at several levels of proficiency, and report the results of preliminary automatic speech recognition (ASR) experiments on this corpus using HTK-based ASR systems. The speech corpus contains both read and conversational speech recorded simultaneously on wide-band and telephone channels, and has detailed time aligned transcriptions. The immediate goal of the ASR experiments is to assess the difficulty of the ASR problem in language learning exercises and thus to gauge how current ASR technology may be used in conversational computer assisted language learning (CALL) systems. The long-term goal of this research, of which the data collection and experiments are a first step, is to incorporate ASR into computer-based conversational language instruction systems.
[67] W. Byrne, M. Finke, S. Khudanpur, J. McDonough, H. Nock H, M. Riley, M. Saraclar, C. Wooters, and G. Zavaliagkos. Pronunciation modelling for conversational speech recognition: A status report from WS97. In IEEE Automatic Speech Recognition and Understanding Workshop, 1997.
[ bib | .pdf ]

Accurately modelling pronunciation variability in conversational speech is an important component for automatic speech recognition. We describe some of the projects undertaken in this direction at WS97, the Fifth LVCSR Summer Workshop, held at Johns Hopkins University, Baltimore, in July-August, 1997. We first illustrate a use of hand-labelled phonetic transcriptions of a portion of the Switchboard corpus, in conjunction with statistical techniques, to learn alternatives to canonical pronunciations of words. We then describe the use of these alternate pronunciations in a recognition experiment as well as in the acoustic training of an automatic speech recognition system. Our results show a reduction of word error rate in both cases band 2.2% with acoustic retraining.
[68] W. Byrne. Information geometry and maximum likelihood criteria. In Conference on Information Sciences and Systems, Princeton, NJ, 1996.
[ bib | .pdf ]

This paper presents a brief comparison of two information geometries as they are used to describe the EM algorithm used in maximum likelihood estimation from incomplete data. The Alternating Minimization framework based on the I-Geometry developed by Csiszar is presented first, followed by the em-algorithm of Amari. Following a comparison of these algorithms, a discussion of a variation in likelihood criterion is presented. The EM algorithm is usually formulated so as to improve the marginal likelihood criterion. Closely related algorithms also exist which are intended to maximize different likelihood criteria. The 1-Best criterion, for example, leads to the Viterbi training algorithm used in Hidden Markov Modeling. This criterion has an information geometric description that results from a minor modification of the marginal likelihood formulation.
[69] K. Wang, S. Shamma, and W. Byrne. Noise robustness in the auditory representation of speech signals. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1993.
[ bib ]

[70] W. Byrne. Generalization and maximum likelihood from small data sets. In IEEE-SP Workshop on Neural Networks in Signal Processing, 1993.
[ bib | .pdf ]

An often encountered learning problem is maximum likelihood training of exponential models. When the state is only partially specified by the training data, iterative training algorithms are used to produce a sequence of models that assign increasing likelihood to the training data. Although the performance as measured on the training set continues to improve as the algorithms progress, performance on related data sets may eventually begin to deteriorate. The cause of this behavior can be seen when the training problem is stated in the Alternating Minimization framework. A modified maximum likelihood training criterion is suggested to counter this behavior. It leads to a simple modification of the learning algorithms which relates generalization to learning speed. Training Boltzmann Machines and Hidden Markov Models is discussed under this modified criterion.
[71] W. Byrne, J. Robinson, and S. Shamma. The auditory processing and recognition of speech. In Proceedings of the Speech and Natural Language Workshop, pages 325-331, October 1989.
[ bib ]

[72] W. Byrne, R. Zapp, P. Flynn, and M. Siegel. Adaptive filtering in microwave remote heart monitors. In IEEE Engineering in Medicine and Biology Society, Seventh annual Conference, 1985.
[ bib ]


This file has been generated by bibtex2html 1.65