Archived Seminars by Year
February 18, 1997
Stanley Chen, Carnegie Mellon University
AbstractRecent work has demonstrated that maximum entropy models are a promising technique for combining multiple sources of information; applications of maximum entropy models have included language modeling, prepositional phrase attachment, and machine translation. Smoothing is a technique for improving probability estimates in the presence of limited data. Smoothing has yielded substantial performance gains in a variety of applications; however, there has been very little work in developing smoothing techniques for maximum entropy models. In this work, we compare smoothed maximum entropy n-gram models with smoothed conventional n-gram models on the task of language modeling. We show that existing smoothing algorithms for maximum entropy models compare unfavorably to smoothing algorithms for conventional models, and propose a novel method that yields comparable performance to conventional smoothing techniques.
February 25, 1997
Eugene Charniak, Brown University
AbstractWe describe a parsing system based upon a language model for English that is, in turn, based upon assigning probabilities to possible parses for a sentence. This model is used in a parsing system by finding the parse for the sentence with the highest probability. This system outperforms previous schemes. As this is the third in a series of parsers by different authors that are similar enough to invite detailed comparisons but different enough to give rise to different levels of performance, we also report on some experiments desinged to identify what aspects of these systems best explain their relative performance.
March 4, 1997
David B. Pisoni, Indiana University at Bloomington
AbstractIn this talk I will focus on several general issues surrounding perceptual learning in speech perception. While my major interest centers on the learning of nonnative speech contrasts by mature adults, much of what I have to say is also relevant to other issues dealing with current theoretical accounts of speech perception and perceptual development. Central to my presentation is a concern for the nature of the perceptual changes that take place when the sound system of a language is acquired during development. In particular, we have been interested in what happens to a listener's perceptual abilities when he or she acquires a native language. What happens to a listener's ability to identify and discriminate speech sound contrasts that are not present in the language-learning environment? Are the listener's perceptual abilities permanently lost because the neural mechanisms have atrophied due to lack of sensory stimulation during development, or are they simply realigned and only temporarily modified due to changes in selective attention? It is well known that native speakers of Japanese learning English generally have difficulty discriminating and categorizing the English phonemes /r/ and /l/, even after years of experience. Previous research that attempted to train Japanese listeners to distinguish this contrast using synthetic stimuli showed little success, especially when generalization to natural tokens containing /r/ and /l/ was tested. In this presentation, I describe the major results of an on-going research program on perceptual learning of speech sounds in non-native speakers of English. In all of our studies, we used a novel training procedure that differed from earlier attempts to modify Japanese listener's perception of English /r/ and /l/. Japanese subjects were trained in an identification paradigm using multiple natural exemplars contrasting /r/ and /l/ phonemes from a variety of phonetic environments. A pretest-posttest design combined with tests of generalization containing novel natural tokens were used to assess the effectiveness of this "high-variability" training procedure. Analysis of data from several experiments showed that the new training procedure was much more effective than earlier techniques. Reliable differences were obtained in performance between pretest and posttest perception scores. Moreover, reliable differences were observed in several generalization tests which involved presentation of novel tokens of /r/ and /l/ that the subjects were never trained on. The best generalization performance was observed when subjects received novel words produced by a talker that they had heard during the training phase. Other perceptual learning experiments carried out at the ATR Labs in Kyoto, Japan with monolingual subjects replicated our original findings and assessed the retention and time-course of this learning. Finally, our most recent study examined the transfer of perceptual knowledge to control over production of /r/ and /l/ to assess perceptuo-motor interactions between speech perception and production. The results demonstrate the importance of stimulus variability in learning to perceive and produce novel phonetic contrasts that are not distinctive in a listener's native language.
March 11, 1997
“Statistical speech recognition using a functional model of "hidden" processes in human speech communication”
Li Deng, University of Waterloo
AbstractIn this talk, I will present a general Bayesian statistical framework for constraint-free speech recognition based on a functional model for global characteristics of human speech communication (production and perception). The model consists of a nonlinear (autosegmental-based) phonological component (which determines the structure of the speech recognizer) and a dynamic phonetic-interface component, and contains the conventional HMM-based speech model as a highly simplified and degenerated special case. I will show how the model can be efficiently parameterized, and how the model parameters can be automatically estimated using a very small amount of acoustic data of speech. Some evaluation results of the speech recognizer using TIMIT database will be presented. Finally, I will outline our current work on applying the model to multilingual speech recognition, aiming at cross-language portability (i.e. constructing speech recognizers for a target language using training speech data from only one or two source languages.)
Speaker BiographyLi Deng (S'83-M'86-SM'91) received the B.S. degree from University of Science and Technology of China in biophysics in 1982, the M.S. degree from University of Wisconsin-Madison in electrical engineering in 1984, and the Ph.D. degree from University of Wisconsin-Madison in electrical engineering in 1986. He worked on large vocabulary automatic speech recognition at INRS-Telecommunications, Montreal, Canada, from 1986 to 1989. Since 1989, he has been with Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario, Canada, where he is currently Full Professor. From 1992 to 1993, he conducted sabbatical research at Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Mass working on statistical models of speech production and the related speech recognition algorithms. His research interests include acoustic-phonetic modeling of speech, speech recognition, synthesis, and enhancement, speech production and perception, statistical methods for signal analysis and modeling, nonlinear signal processing, neural network algorithms, computational phonetics and phonology for the world's languages, and auditory speech processing.
March 25, 1997
Jan Hajic, Charles University
AbstractFollowing the Prague tradition, dependency-based description of the formal representation of language structures has become the basis for building an annotated Czech corpus (size approx. 1 mil. words). The resulting corpus will also become part of the Czech National Corpus, which currently holds about 35 mil. words and will reach 100 mil. words by the end of 1998. During the talk, all three levels of annotation of the PDT will be briefly explained (morphological, analytical, tectogrammatical). Then, the talk will concentrate of the analytical level which is currently being worked on. The principles of annotation will be presented in detail together with some interesting phenomena and the rules for their representation and annotation, such as multiword expressions, non-continuous sentence constituents, incomplete sentences, ellipsis, coordination, parenthesis, etc. The software used for automatic preprocessing of the input text as well as the hand-annotation support software will be described and demonstrated.
March 31, 1997
Kemal Oflazer, Bilkent University
AbstractThis talk presents a constraint-based approach to morphological disambiguation and tagging in which individual constraints vote on matching morphological parses, or sequences of parses, and disambiguation of all the tokens in a sentence is performed at the very end, by selecting parses or sequences of parses that receive the highest votes. This constraint application paradigm makes the outcome of the disambiguation independent of the rule sequence, and hence relieves the rule developer from worrying about potentially conflicting rule sequencing found in other systems. We have applied our approach to both Turkish and English. For Turkish, a language with complex agglutinative word structures, displaying rather different types of morphological ambiguity not found in languages like English, we have used parse voting and with about 500 constraint rules and some additional simple statistics, we have attained a recall of 95-96% and a precision of 94-95% with about 1.01 parses per token. We have recently applied path voting to tagging English where constraints efficiently vote on all possible matching sequences of tags, and have obtained quite similar results. Our current implementations are prototypes and we outline an efficient implementation technique using finite state transducers and transducer composition.
April 8, 1997
Frank K. Soong, Bell Labs, Lucent Technologies
AbstractA highly reverberant acoustic environment can have negative effects on the audio quality of a hands-free communication device like teleconferencing system, automatic speech recognizer (ASR), etc. In this talk, first the nature of reverberation of a room is characterized. The temporal "shadowing effects" and non-minimum phase characteristics of a typical impulse response of a large room, which can last a quarter second to half a second, are analyzed. Due to the non-minimum phase nature of such a long impulse response, an LPC-based, inverse filter can only provide a rather limited dereverberation. As a result, significant residual reverberation still exist and it can deteriorate the audio quality of a teleconference and the recognition performance of an ASR. We further investigate the dereverberation problem using single- or multi-channel approach. For a single-channel setup, an inverse filter with an optimal delay is chosen to maximally deconvolve a nonminimum phase impulse response and up to 10 dB dereverberation improvement can be achieved. For a multi(two)- channel setup, a least-squares solution is devised and very clean dereverberation (~35 dB) can be obtained. We will demonstrate the original reverberant speech along with various dereverberation algorithms that we tried. However, in the above single- and multi-channel approaches, the requirement that the room impulse response needs to be identified first renders the solution not readily applicable to a real situation. For example, the fact that the impulse response from a sound source to a microphone pickup can change instantaneously when the sound source is switched from one speaker to another or gradually when a speaker moves his/her head position demands it to be identified continously . Currently we are investigating blind deconvolution techniques which use available signals like speech directly rather than chirp-like artificial probing signals and identify a room impulse response quickly. Fast adaptation schemes for tracking a changing room impulse response are also under current study and related issues will be discussed in this talk.
April 15, 1997
Ponani S. Gopalakrishnan, IBM TJ Watson Research Center
AbstractResearch effort in speech recognition is moving towards transcription of more realistic data sources. At IBM we have been examining acoustic modeling issues in the context of broadcast news transcription. This data exhibit many of the problems we encounter in speech recognition, including a variety of speaking styles, different signal and background conditions and a variety of topics. In this talk we will highlight some of the issues in automatically transcribing such varied sources. We will discuss some of the acoustic modeling issues we have been examining, including rapid adaptation techniques, and present results on this and other large vocabulary speech recognition tasks.
April 29, 1997
Michael Kelly, University of Pennsylvania
AbstractI will examine phonological cues to grammatical class in English, and implicit knowledge of such cues. In particular, I will describe experiments showing that both native and nonnative English speakers can learn such patterns, and that knowledge by nonnative speakers is not correlated with their age of arrival in an English speaking environment. I will discuss implications of this lack of "sensitive period" effects for our understanding of language acquisition. In the next part of the talk, I will address the question of how much information about grammatical class can be squeezed out of phonology, using connectionist models as an exploratory device. Finally, I will discuss implications of this research for language acquisition, production, and innovation.
September 30, 1997
David D. Lewis, AT&T Labs - Research
AbstractRandom sampling is often used to choose training data for language processing tasks such as text retrieval, email filtering, parsing, tagging, etc. We propose an alternative approach of labeling data, training the system, and finding examples for which the system is least certain of the correct answer. On a text categorization task this method, which we call uncertainty sampling, reduced by up to 500-fold the amount of training data needed to achieve a given level of categorization accuracy. The computational learning theory results that inspired our own (heuristic) work suggest that, asymptotically, a labeled training set of size *logarithmic* in the amount of unlabeled training data can be used without sacrificing accuracy. For applications where unlabeled training data is cheap, this would be the next best thing to a free lunch. A great deal is unknown about these methods, and we will discuss avenues for research.
Speaker BiographyDavid D. Lewis is a Principal Research Staff Member at AT&T Labs. Prior to that he was a Member of Technical Staff at AT&T Bell Labs, and a Research Associate at the University of Chicago. He did his Ph.D. research at the University of Massachusetts under Bruce Croft. Lewis' research interests are in the areas of information retrieval, machine learning, and natural language processing.
October 7, 1997
Robert Frank, Department of Cognitive Science, The Johns Hopkins University
AbstractSentence processing is almost always an effortless task. This seemingly banal observation becomes considerably more puzzling when juxtaposed with the somewhat less obvious observation that natural language syntax exhibits rampant (local) ambiguity. That is to say, in the processing of a given sentence there are likely to be many points at which there are multiple analyses compatible with the input seen thus far, but only one of these may turn out to be consistent with the remainder of the utterance. Many traditional models of parsing deal with such ambiguity by exploiting some type of parallelism, carrying a number of partial parses forward from the point of ambiguity for further consideration. Unfortunately, such a proliferation of parses is bound to consume significant time and space resources, rendering this type of approach inappropriate as a model for human processing. An alternative approach to the local ambiguity problem has been suggested by Marcus, Hindle, and Fleck (1983) in their work on D-theory. In this work and in a significant number of papers that have followed in this line of inquiry, the parser constructs an underspecified description of a parse tree by positing domination (as opposed to parent) relations among nodes in a phrase structure tree. In this talk, I will suggest that certain empirical and conceptual shortcomings of the D-theory approach to local ambiguity can be overcome if the descriptive primitive is changed from domination to the more abstract and linguistically ubiquitous relation of c-command. I will illustrate the advantages of c-command over domination with a range of examples from English and Japanese.
October 14, 1997
Fil Alleva, Microsoft Research, Microsoft Corporation
AbstractSince the earliest days of computing automatic speech recognition technology has ridden the technology wave that has come to be known as Moore's Law. This is vividly illustrated by the market introduction of several general-purpose continuous speech recognition products. Besides Moore's law the two things that have made this possible have been advances in acoustic modeling, especially adaptation technologies and advances in decoding techniques that permit real-time performance on today's PC's. I will discuss a broad range of these decoding techniques including, beam search, A-star and variants, multi-pass search, incremental application of knowledge, tree-organization, heuristic pruning and look-ahead techniques. Additionally I will discuss the relative merits of these search techniques with respect to memory size and performance.
Speaker BiographyFil Alleva received the BS degree in mathematics from Carnegie Mellon University in 1980. As an undergraduate he worked on the Harpy speech recognition system. Later as a Project Scientist at CMU he contributed to the Agora, Sphinx I and Sphinx II system, for which he was awarded the 1992 Allen Newell Research Excellence Medal. He joined Microsoft Research in 1993 and is currently managing CSR system development. Mr. Alleva is a member of the IEEE and has published numerous papers on spoken language technology. His current professional interests are in all areas of spoken language technology, particularly heuristic search and language modeling.
October 21, 1997
Michael Phillips, Applied Language Technologies
AbstractApplied Language Technologies (ALTech) has been developing and deploying a number of large-scale speech recognition systems for telephone-based transactions and services. The applications include enhanced Yellow Pages for a phone company, a flight reservation system for a major airline, and a stock quote system for an electronic brokerage company. The deployment of these application presented a number of technical challenges, including barge-in, very large vocabularies, and large numbers of simultaneous callers. In order to make these systems successful, we have also had to solve difficult user-interface problems. In particular, these systems must support first-time and occasional users who need to be guided through the interface, as well as expert users who need to be able to quickly perform the functions they desire. In this talk, I will describe these applications in more detail, and talk about our approach and solutions to these technical and user-interface issues.
Speaker BiographyMichael Phillips is the Vice-President of Engineering and co-founder of Applied Language Technologies (ALTech). Before starting ALTech in 1994, Mr. Phillips was a research scientist in MIT's Spoken Language Systems Group where he was responsible for many aspects of the development of Summit, MIT's segment-based speech recognition system, including acoustic modeling, lexical access, and integration with natural language constraints. Prior to joining the group at MIT in 1987, he worked on speech recognition at Carnegie-Mellon University and Scott Instruments Corp.
October 29, 1997
Yariv Ephraim, Department of Electrical and Computer Engineering, George Mason University
AbstractExplicit expressions for the second order statistics of cepstral components representing clean and noisy signal waveforms are derived. The noise is assumed additive to the signal, and the spectral components of each process are assumed statistically independent complex Gaussian random variables. The key result developed here is an explicit expression for the cross-covariance between the log-spectra of the clean and noisy signals. In the absence of noise, this expression is used to show that the covariance matrix of cepstral components representing a vector of N signal samples, approaches a fixed, signal independent, diagonal matrix at a rate of 1/(N*N). In addition, the cross-covariance expression is used to develop an explicit linear minimum mean square error estimator for the clean cepstral components given noisy cepstral components. Recognition results on the ten English digits using the fixed covariance and linear estimator are presented. * Joint work with Dr. Mazin Rahim of AT&T Labs.
Speaker BiographyYariv Ephraim received the D.Sc. in Electrical Enginering in 1984 from the Technion-Israel Institute of Technology. He was a Research Scholar at Stanford University from 1984 through 1985, and a Member of Technical Staff at AT&T Bell Laboratories from 1985 until 1993. He has been with George Mason University since 1991 where he currently is an Associate Professor of Electrical and Computer Engineering. His current research interests are statistical signal processing with applications to speech signals and array processing. He was elected Fellow of the Institute of Electrical and Electronic Engineers in 1994.
November 4, 1997
Paul Luce, Department of Psychology, SUNY, Buffalo
AbstractProbabilistic phonotactics refers to the positional and sequential probabilities of speech sounds within and between spoken syllables and words. I will discuss research examining the role of probabilistic phonotactics in both the perception of isolated spoken words and the detection of words in connected speech. Phonotactic effects reveal a number of interesting properties about the architecture of the system responsible for the perception of spoken language. In particular, effects of probabilistic phonotactics provide insights into the levels of representation and process involved in spoken word recognition, as well as the role of form-based lexical representations in segmenting words from the speech stream. I will argue that accounting for the role of phonotactics in recognition provides an important evaluation metric for current theories of spoken word recognition.
November 18, 1997
Jerome R. Bellegarda, Apple Technology Group, Apple Computer, Inc.
AbstractA new framework is proposed to integrate the various constraints, both local and global, that are present in the language. Local constraints are captured via n-gram language modeling, while global constraints are taken into account through the use of latent semantic analysis. An integrative formulation is derived for the combination of these two paradigms, resulting in several families of multi-span language models for large vocabulary speech recognition. Because of the inherent complementarity in the two types of constraints, the performance of the integrated language models, as measured by perplexity, compares favorably with the corresponding n-gram performance.
Speaker BiographyJerome R. Bellegarda received the Diplome d'Ingenieur degree (summa cum laude) from the Ecole Nationale Superieure d'Electricite et de Mecanique, Nancy, France, in 1984, and the M.S. and Ph.D. degrees in Electrical Engineering from the University of Rochester, Rochester, NY, in 1984 and 1987, respectively. In 1987 he was a Research Associate in the Department of Electrical Engineering at the University of Rochester, developing multiple access coding techniques. From 1988 to 1994 he was a Research Staff Member at the IBM T.J. Watson Research Center, Yorktown Heights, NY, working on various improvements to the modeling component of the IBM speech recognition system, and developing advanced feature extraction and recognition modeling algorithms for cursive on-line handwriting. In 1994 he joined Apple Computer, Cupertino, CA, where he is currently Principal Scientist in the Spoken Language Research Group. At Apple he has worked on speaker adaptation, Asian dictation, statistical language modeling, and advanced dialog interactions. His research interests include voice-driven man-machine communications, multiple input/output modalities, and multimedia knowledge management.
December 2, 1997
Mukund Padmanabhan, IBM TJ Watson Research Center