Archived Seminars by Year
February 3, 1998
Allen Gorin, AT&T Labs
AbstractWe are interested in providing automated services via natural spoken dialog systems. By natural, we mean that the machine understands and acts upon what people actually say, in contrast to what one would like them to say. There are many issues that arise when such systems are targeted for large populations of non-expert users. In this talk, we focus on the task of automatically routing telephone calls based on a user's fluently spoken response to the open-ended prompt of "How may I help you?." We first describe a database generated from 10,000 spoken transactions between customers and human agents. We then describe methods for automatically acquiring language models for both recognition and understanding from such data. Experimental results evaluating call-classification from speech are reported for that database. These methods have been embedded and further evaluated within a spoken dialog system, with subsequent processing for information retrieval and form-filling.
Speaker BiographyAllen Gorin received the B.S. and M.A. degrees in Mathematics from SUNY at Stony Brook in 1975 and 1976 respectively, then the Ph.D. in Mathematics from the CUNY Graduate Center in 1980. From 1980-83 he worked at Lockheed investigating algorithms for target recognition from time-varying imagery. In 1983 he joined AT&T Bell Labs in Whippany where he was the Principle Investigator for AT&T's ASPEN project within the DARPA Strategic Computing Program, investigating parallel architectures and algorithms for pattern recognition. In 1987, he was appointed a Distinguished Member of the Technical Staff. In 1988, he joined the Speech Research Department at Bell Labs in Murray Hill, and is now at AT&T Labs Research in Florham Park. His long-term research interest focuses on machine learning methods for spoken language understanding. He has served as a guest editor for the IEEE Transactions on Speech and Audio, and was a visiting researcher at the ATR Interpreting Telecommunications Research Laboratory in Japan during 1994. He is a member of the Acoustical Society of America and a Senior Member of the IEEE.
February 17, 1998
“Combining Neural Networks and Context-Driven Search for On-Line, Printed Handwriting Recognition in the Newton”
Larry Yaeger, Apple Computer, Inc.
AbstractWhile on-line handwriting recognition is an area of long-standing and ongoing research, the recent emergence of portable, pen-based computers has focused urgent attention on usable, practical solutions. I will discuss a combination and improvement of classical methods to produce robust recognition of hand-printed English text, for a recognizer shipping in new models of Apple Computer's Newton MessagePad and eMate. Despite the Newton's ignominious past, this second-generation "Print Recognizer" is widely regarded to have provided the world's first truly useable handwriting recognition system. A straightforward combination of an artificial neural network (ANN), as a character classifier, with a context-driven search over segmentation and word recognition hypotheses provides the basis of this recognition system. Long-standing issues relative to training, generalization, segmentation, models of context, probabilistic formalisms, etc., needed to be resolved, however, to obtain excellent performance. I will give an overview of the entire recognition architecture, plus present a number of recent innovations in the application of ANNs as character classifiers for word recognition, including integrated multiple representations, normalized output error, negative training, stroke warping, frequency balancing, error emphasis, and quantized weights. User-adaptation and cursive recognition extensions to this technology will be discussed briefly.
Speaker BiographyLarry Yaeger http://pobox.com/~larryy is a programmer-scientist who has made technical contributions in the fields of neural networks and handwriting recognition (Newton's "Print Recognizer"), artificial life (PolyWorld), user interface design (for Koko the gorilla), computer graphics (The Last Starfighter, 2010, and Labyrinth), and computational fluid dynamics. His efforts have aided the Newton Systems Group, the Apple Research Labs, and Alan Kay's Vivarium Program at Apple Computer, as well as Digital Productions, and various aerospace companies.
February 24, 1998
Stefanie Shattuck-Hufnagel, Massachusetts Institute of Technology
AbstractThe phenomenon of acoustic-phonetic variability in the way words, syllables and sound segments are produced in different contexts in fluent continuous speech is well known, and raises the question of whether this variability is governed by abstract structures at higher levels. Evidence in support of this possibility is Dound both in phonetic rules based on native language user intuitions, and in measurements of phenomena that are less accessible to intuition, such as preboundary lengthening. Early investigations focussed on traditional morphosyntactic structures, such as lexical words and syntactic clauses and phrases, as candidates for the abstract structures that might govern phonetic variability. In recent decades, however, developments in the theory of prosody have provided a new set of candidate structures in the form of the elements of the prosodic hierarchy. These include constituent structures such as utterances, intonational phrases, prosodic words etc., as well as prominences such as nuclear and prenuclear pitch accents. Evidence is accumulating that many aspects of phonological and phonetic variation in spoken utterances are systematic with respect to these prosodic structures. In this talk we will explore the hypothesis that traditional morphosyntactic structures influence the phonetic realization of words and sounds indirectly, via their influence on the prosodic structures that directly govern the phonetic choices that speakers make.
March 3, 1998
Yariv Ephraim, George Mason University
AbstractRobust speech processing poses one of the greatest challenges to the speech research community. Speech recognizes and speech coders are particularly sensitive to channel mismatch. The primary goal of robust speech processing is to compensate for that mismatch. A related goal is to improve perceptual aspects of noisy speech signals for individuals with normal or impaired hearing. The complex nature of speech signals and the large number of adverse conditions make this problem particularly difficult. Important channel mismatches are related to additive noise, as encountered in wireless communications, and to convolutional noise, representing room reverberations. The seminar will review a number of research challenges and describe some recent results.
Speaker BiographyYariv Ephraim received the D.Sc. in Electrical Enginering in 1984 from the Technion-Israel Institute of Technology. He was a Research Scholar at Stanford University from 1984 through 1985, and a Member of Technical Staff at AT&T Bell Laboratories from 1985 until 1993. He has been with George Mason University since 1991 where he currently is an Associate Professor of Electrical and Computer Engineering. His current research interests are statistical signal processing with applications to speech signals and array processing. He was elected Fellow of the Institute of Electrical and Electronic Engineers in 1994.
March 24, 1998
Louis D. Braida, Massachusetts Institute of Technology
AbstractAlthough the intelligibility of the acoustic speech signal is usually very high, in many situations speech reception is improved if cues derived from the visible actions of the talker's face are integrated with cues derived from the acoustic signal. Such integration aids listeners with normal hearing under difficult communication conditions and listeners with hearing impairments, under nearly all listening conditions. This talk will describe models of audiovisual integration that have been successful in predicting how well listeners combine visual speech cues with auditory cues. It will also describe how such models can be adapted to predicting the magnitude of the McGurk Effect, illusory perceptions elicited when the auditory and visual components of speech are mismatched. Finally, the talk will discuss recent research aimed at the development of supplements to speechreading based on the use of automatic speech recognition.
March 31, 1998
Claire Cardie, Cornell University
AbstractFinding simple, non-recursive, base noun phrases is an important subtask for many natural language processing applications. While previous empirical methods for base NP identification have been rather complex, this talk instead propose a very simple algorithm that is tailored to the relative simplicity of the task. In particular, the talk will present a corpus-based approach for finding base NPs by matching part-of-speech tag sequences. The training phase of the algorithm is based on two successful techniques: first the base NP grammar is read from a "treebank'' corpus (a la Charniak); then the grammar is improved by selecting rules with high "benefit'' scores (a la Brill). Using this simple algorithm with a nave heuristic for matching rules, we achieve suprising accuracy in an evaluation on the Penn Treebank Wall Street Journal.
April 7, 1998
Louis Goldstein, Yale University
AbstractWithin the framework of Articulatory Phonology (e.g., Browman & Goldstein, 1992), the primitive units of phonological structure are hypothesized to be articulatory gestures. Gestures are abstract characterizations of vocal tract constriction actions and are formalized using concepts that have proven useful in modeling other kinds of actions: coordinative structures and dynamical systems. Previous work to be summarized in this talk demonstrated how a variety of superficially unrelated phonetic and phonological alternations could result from general principles of variation in gestural patterning: changes in the magnitudes of individual gestures, and changes in temporal overlap of gestures. Recent research (e.g., Byrd, 1996), has shown that pairs of gestures differ considerably in the extent to which they exhibit variability of in temporal overlap. It is possible to capture this by explicitly representing the "bonding" strength of the coordination relations (or coordination constraints) between pairs of gestures. This talk will show how certain properties of syllable structure in English fall out of the simultaneous satisfaction of competing coordination constraints. In addition, a self-organization approach to the problem of how patterns of gestural bonding arise will be presented. Application of this approach to phonological development will be discussed. In particular, a number of trends in the (apparent) order of emergence of consonants in infants' early words can be explained by hypothesizing that at first, infants are producing constriction actions with no systematic coordination and that constraints on intergestural coordination among actions develop gradually. References:Browman, C. & Goldstein, L. (1992) Articulatory phonology: an overview. Phonetica, 49, 155-80.Byrd, D. (1996). Influences on articulatory timing in consonant sequences. Journal of Phonetics, 24, 209-244.
April 14, 1998
Raymond J. Mooney, The University of Texas at Austin
AbstractWe are exploring the application of relational learning methods, such as inductive logic programming, to the construction of natural language processing systems. We have developed a system, CHILL, for learning a deterministic parser from a corpus of parsed sentences. CHILL can construct complete natural-language interfaces that translate database queries directly into executable logical form. It has been tested on English queries for a small database on U.S. geography, answering queries more accurately than a previous hand-built system. It has also recently been tested on Spanish, Turkish, and Japanese queries for the same database, and English queries about jobs posted to the newsgroup misc.jobs.offered and queries about restaurants in the California Bay Area. We are also developing a system for inducing pattern-match rules for extracting information from natural-language texts. This system has obtained promising initial results on extracting information from postings to misc.jobs.offered in order to assemble a database of available jobs. Our overall goal is to combine these techniques to automate the development of natural language systems that can answer queries about information available in a body of texts, such as newsgroup postings or web pages.
April 21, 1998
“From "Resource Management" to "Call Home" - A Little Science, a Little Art, and Still a Long Way to Go”
Andrej Ljolje, AT&T Labs - Research
AbstractHidden Markov Models (HMMs) were well established in the late eighties during the height of the Resource Management evaluations. They have been so successful that they form the basis of virtually all speech recognition systems today. In the following years, most of the research effort was devoted to speaker adaptation and improving recognizer structure within the HMM framework (phoneme context dependency clustering, pronunciation modeling). Large improvements in performance have also been achieved on very small tasks (digits, spelled letters) using discriminative training to minimize empirical error rate and signal conditioning techniques. Additional small improvements were achieved using segmental duration modeling and explicit modeling of correlations, either across observation parameters or over time. With the advent of tasks such as Switchboard and Call Home where the speech is collected in a more natural setting and where the word error rate was initially twice as high as the word accuracy, it was clear that more needed to be done than just collection of more data. This resulted in widespread use of Vocal Tract Length Normalization (VTLN) and Speaker Adaptive Training (SAT).Despite of all the new acoustic modeling techniques, three observations dominate the current perception of the modeling field: A mismatch between the training speech and test speech (different microphone, spectral filtering, speaking style, noise, echo etc.) can cause drastic degradation in recognition performance; There is evidence that the speech transcription differences between human transcribers on Named Entities are much closer to automatic speech recognition performance, than the differences on function words. Acoustic modeling dominates in the case of Named Entities and language/semantic modeling in the case of function words; Baseline performance using HMMs still determines the final recognition performance for different recognition systems, as additional techniques seem to consistently improve performance across systems, regardless of the baseline performance. The science gave us the improvements with the new modeling techniques, the art still dominates the baseline performance and we have a long way to go to approach human robustness to environment changes and use of syntactic and semantic knowledge in recognizing speech.
Speaker BiographyDr. Andrej Ljolje grew up in Croatia. He was awarded B.Sc. degree in Cybernetics and Control Engineering (with Mathematics) from University of Reading, England in 1982. His Ph.D. degree in Speech Processing was awarded by University of Cambridge, England, in 1986. From 1985 to 1987 he was a Research Fellow at Trinity Hall, Cambridge. He spent a year at AT&T Bell Labs as a Post-doc, where he remained until the trivestiture in 1996 as a Member of Technical Staff. Since then he has been with AT&T Labs as a Principal Technical Staff Member. His work has been primarily in acoustic modeling for tasks ranging from a few words to unlimited vocabularies.
April 28, 1998
Ananth Sankar, SRI International
AbstractWe present a detailed experimental study of Gaussian splitting and merging algorithms to train the parameters of state-clustered hidden Markov model (HMM) automatic speech recognition (ASR) systems. Gaussian splitting uniformly distributes the training data into the model parameters, and gives very different estimates from SRI's previous training algorithm. However, it does not significantly alter recognition performance. Gaussian merging gives robust parameter estimates that is found to be critical for both speaker-independent and speaker-adaptive recognition. A combination of these techniques, the Gaussian Merging-Splitting (GMS) algorithm, is then used to explore a variety of HMM structures. For a fixed number of Gaussian parameters, it is found that decreasing the number of state clusters while increasing the number of Gaussians per cluster gives better performance using the GMS algorithm. However, to robustly estimate systems with a large number of state clusters, we propose a model where the HMM with a larger number of state clusters is a transformed version of an HMM with a smaller number of clusters. A set of transforms is used for each of the state clusters in the larger system, and the transforms are trained using maximum-likelihood estimation. Experimental results show that this method gives superior performance to the GMS algorithm.
October 6, 1998
Siegfried (Jimmy) Kunzmann, IBM Speech Systems - Germany
October 13, 1998
“Using Eye Movements to Study On-Line Sentence Processing in Children: Finding the Kindergarten-Path Effect”
John C. Trueswell, Dept. of Psychology and Institute for Research in Cognitive Science, University of Pennsylvania
AbstractI will report on a new method for studying the language processing strategies of children, in which a head-mounted eye tracking system was used to monitor eye movements as children responded to spoken instructions. Systematic differences were found in how children and adults interpret ambiguous phrases. Five year old children relied heavily on the linguistic properties of the input, showing less ability or inclination than adults to coordinate these properties with information from the situation or context. The findings suggest that a central component of human development is acquiring the capacity to rapidly coordinate the information generated from multiple perceptual and cognitive systems.
October 20, 1998
Harry Printz, IBM T.J. Watson Research Center
AbstractMaximum entropy / minimum divergence modeling is a powerful technique for constructing probability models, which has been applied to a wide variety of problems in natural language processing. A maximum entropy / minimum divergence (MEMD) model is built from a base model, and a set of feature functions, whose empirical expectations on some training corpus are known. A fundamental difficulty with this technique is that while there are typically millions of feature functions that could be incorporated into a given model, in general it is not computationally feasible, or even desirable, to use them all. Thus some means must be devised for determining each feature's predictive power, also known as its gain. Once the gains are known, the features can be ranked according to their utility, and only the most gainful ones retained. This talk presents a new algorithm for computing feature gain that is fast, accurate and memory-efficient.
Speaker BiographyHarry Printz is a Research Staff Member at IBM's Watson Research Center in Yorktown Heights, NY, where he leads the language modeling team. He has previously worked on reconfigurable hardware at the Digital Equipment Corporation Paris Research Laboratory, and on medical computing at Bolt, Beranek and Newman. He holds a PhD in Computer Science from Carnegie Mellon, a BA in Mathematics and Philosophy from Oxford University, where he was a Rhodes Scholar, and a BA and an MA in Physics from Harvard University. His current research interests are in mathematical models of language and speech.
October 27, 1998
Michael Picheny, IBM T.J. Watson Research Center
AbstractAdvances in speech recognition over the last 20 years have been heavily driven by three factors: more data, more computation, and lots of competition. This talk will describe how these factors have driven the error rate on large vocabulary continuous speech down by more than a factor of four over the last several years, using various examples from signal processing, acoustic modelling, and language modelling. The talk will conclude with a demonstration of some technologies that have resulted from this progress, including IBM's latest speech dictation product, ViaVoice 98.
November 3, 1998
Xiaoqin Wang,, Biomedical Engineering, JHU School of Medicine
AbstractAs studies in echo-locating bat have taught us how the brain processes sonar signals, our understanding of cortical processing of communication sounds in nonhuman primate models can provide invaluable insights into brain mechanisms underlying perception of speech or speech-like signals. I'll discuss recent work from my laboratory in a vocal primate, common marmoset, in which we quantitatively analyzed their species-specific vocalizations and systematically studied corresponding cortical responses. Our study revealed that marmoset vocalizations are highly complex but well structured and contains precise information for the recognition of call types as well as caller identity in a finite-dimensional space. The characteristics of marmoset vocalizations suggests that cortical representations of these sounds are unlikely to be based on "call-detectors". In our neurophysiological experiments conducted in awake marmosets, we have studied cortical responses to various types of marmoset vocalizations in populations of single neurons in both the primary and secondary auditory cortex. Quantitative techniques were used to synthesize and alter natural vocalizations in order to test sensitivity of cortical neurons to perturbations along important stimulus dimensions. Our results indicate that cortical representations of spectrally and temporally complex vocalizations are based on distributed, partially overlapping neuronal populations, and that the behavioral relevance of these sounds plays an important role in shaping neural responses in the auditory cortex. Taken together, these findings illustrate the importance of studying cortical functions in an appropriate experimental model in order to understand brain mechanisms underlying perception of species-specific communication sounds.
November 10, 1998
Philip Resnik, Linguistics Department/UMIACS, College Park
AbstractParallel corpora -- collections of text in parallel translation -- play an important role in current work on statistical models of machine translation, cross-language information retrieval, and acquisition of lexical resources for multilingual natural language processing. Unfortunately, parallel corpora may be difficult or expensive to obtain, may be too domain- or genre-specific, or may simply not exist for the language pair of interest. I will discuss two approaches to overcoming the acquisition bottleneck for parallel text. The first part of the talk will describe first steps toward using the World Wide Web as a source for parallel text, presenting a conceptually simple but effective technique for automatically identifying parallel translated documents on the Web. The second part of the talk will discuss the use of the Bible as a parallel corpus, describing the initial phase of a project investigating the use of parallel biblical text as a resource for improving multilingual optical character recognition.
November 17, 1998
Donca Steriade, UCLA Linguistics Department
AbstractSyllable structure is postulated in an effort to explain in unified fashion three distinct domains of facts: Syllabic Intuitions: Speakers appear to have reliable knowledge of syllable count and syllable divisions Prosodic Peaks: Some recurrent strings of segments attract tone, stress and metrical ictus in a way that suggests the existence of syllabic constituent such as heavy rimes. Phonotactics: The range of possible segment sequences within words is locally limited in ways that also lend itself to analysis in terms of syllabic units. The prevalent view on this is that constraints on possible syllables largely determine the phonotactic structure of words: e.g. when a CCC cluster is impossible that is attributed to the fact that it cannot be parsed into a coda plus onset sequence. Similar is the idea that the composition of consonant clusters is determined by the law that codas license fewer contrasts relative to onsets (cf. Goldsmith 1990 for a general formulation of this view.) A successful hypothesis regarding syllable structure is one that provides representations and constraints consistent with the data in (a)-(c). In the first part of this talk I argue that the study of syllabic intuitions can progress better if we assume that the phonotactics are largely independent of syllable structure: it is not coda, onset or syllable alignment constraints that yield a successful analysis of phonotactic restrictions. Rather the key to an understanding of segmental phonotactics are syllable-independent conditions that focus on the distribution of perceptual correlates to the features that compose the segments. In the second part, I suggest that when faced with a task of syllable division, speakers rely on a mix of several types of linguistic knowledge, none of which represent knowledge of syllabic organization laws per se. These are: phonotactic knowledge (in particular knowledge of possible word beginnings and ends), phonetic knowledge (knowledge of the coarticulatory effects neighboring segments have on each other), and the uniformity assumption (one segment in the undivided word must correspond to exactly one segment in the syllabically divided output).
December 1, 1998
James R., Sawusch, Department of Psychology, University at Buffalo
AbstractThe perception of speech reflects both a stimulus driven process and influences from the mental lexicon. Some of the lexical influences, such as phoneme restoration, are well known. The focus of this presentation will be on understanding how form based properties of the lexicon interact with perception. The influences of the mental lexicon on perception could include lexical status (is the item a word), lexical neighborhood (how many words is the item similar to), phonotactics (how often does the phoneme sequence occur in the language), and phoneme frequency (how often does the phoneme occur in the language). We have explored the role of these sources of information in perception using phoneme identification and lexical decision tasks. All of these factors influence perception. Furthermore, the time course of these influences can be used to understand the nature of the processing operations in auditory word recognition. Based on results that show we can separately manipulate phoneme frequency, lexical neighborhood, and lexical status, an interactive model with effects at different levels or representations will be outlined. In essence, the perception of speech is the result of a continuous interaction between the auditory to phonetic coding of the sound and the knowledge of the individual about the words of their language.
December 8, 1998
Steve Young, Department of Engineering, Cambridge University & Entropic Limited, UK