Archived Seminars by Year
February 2, 1999
Hynek Hermansky, Oregon Graduate Institute of Science and Technology
AbstractA typical large vocabulary automatic speech recognition (ASR) system consists of three main components: 1) the feature extraction, 2) the pattern classification, and 3) the language modeling. Replacing hardwired prior knowledge in the pattern classification and language modeling modules by the knowledge derived from the data turned out to be one of most significant advances in ASR research in past two decades. However, the speech analysis module so far resisted the recent data-oriented revolution and is typically built on textbook knowledge of speech production and perception. Our current research aims at extending the data-driven notion to speech analysis. Since speech was optimized by millennia of human evolution to serve its purpose of communication through imperfect production-environment-perception channel, it carries imprints of the channel. It could, therefore, be gratifying but would come as no surprise if the data-driven analysis yields solutions consistent with properties of human speech production and perception. In the talk we first describe our efforts to use the concept of mutual information used on a relatively large phoneme hand-labeled database of fluent speech to estimate distribution of information which is most relevant for phoneme classification in the time-frequency plane. We demonstrate that this information is distributed over a significant time interval around the given phoneme. The Linear Discriminant Analysis is then used to derive optimized spectral basis functions and filters (replacing conventional cosines of the cepstral analysis and conventional RASTA and delta filters for deriving dynamic features) in processing the time-frequency plane of the speech signal. The last part of the talk describes our initial efforts for departing from the conventional assumption of importance of across-spectrum features (as is e.g. cepstrum) and towards frequency localized classifiers of relatively long (about 1-sec) temporal patterns of critical-band spectral energies. Work supported by the Department of Defense and by the National Science Foundation.
February 9, 1999
Philip R. Cohen, Oregon Graduate Institute of Science and Technology
AbstractA new generation of systems is emerging in which the user is able to employ natural communication modalities, including speech and pen-based gesture, in addition to the usual graphical user interface technologies. Multimodal systems incorporating pen and voice communication are advantageous for both very small and very large devices, for spatially-oriented applications, and for contexts emphasizing user mobility. These advantages will be illustrated through QuickSet -- a handheld, collaborative, multimodal system that allows continuous speech and pen-based gesturing as input. QuickSet uses a distributed agent architecture, runs on personal computers, and is scalable from wearable to wall-sized systems. Among QuickSet's applications are initializing military simulations, control of virtual reality environments, logistics planning, and medical informatics. The core of QuickSet is a principled method for combining information derived from different modes. We discuss how a set of meaning fragments produced by recognizers for multiple modes can be unified to determine the best joint interpretation. This unification process will be shown to support multimodal discourse and mutual disambiguation of those meaning fragments. Finally, to assess the impact of multimodal interaction, a study will be described in which expert users completed map-based military tasks using both a graphical user interface and QuickSet. In brief, with the multimodal interface, users positioned entities on a map 3 - 8 times faster than with the graphical user interface. Multimodal interaction was preferred by all users, particularly for its efficiency and for its precision in drawing. To illustrate the QuickSet technology and its applications, a video and demonstration of the system will be given.
Speaker BiographyDr. Phil Cohen is currently Professor and Co-director of the Center for Human-Computer Communication at the Oregon Graduate Institute. Prior to joining OGI, he was a senior computer scientist with the Artificial Intelligence Center of SRI International. His research interests include multimodal human-computer interaction, multiagent architectures, dialogue, computational linguistics, theories of communication and collaboration, mobile computing, and collaboration technology. His research is presently supported by DARPA, ONR, NSF, Microsoft, Intel, Boeing, and France Telecom.
February 16, 1999
Michael Miller, Johns Hopkins University
AbstractWe examine image understanding from the classical source-channel point of view of statistical communications. The space of images corresponding to the source is a Grenander deformable template, an orbit under the group action of diffeomorphisms of a prototype. The prior distribution on the source is induced through a distribution on the group. The channel corresponding to the remote sensor generates the observable images reflecting projection and noise which is statistically modeled via a conditional probability density, the likelihood function. Minimum-risk estimation, rate-distortion, and compression are examined by introducing a distance between images via a distance on the group. Three examples are examined, both for finite and infinite dimensional groups associated with geometric and signature variation in image understanding (1,2) and anatomical shape representation (3). 1. U. Grenander, M. I. Miller and A. Srivastava, "Hilbert-Schmidt Lower Bounds for Estimators on Matrix Lie Groups for ATR,'' IEEE Trans. on Pattern Analysis and Machine Intelligence, November 1998. 2. E. Shusterman, M. I. Miller and B. Rimoldi, "Rate-Distortion Theoretic Design of Dictionaries for Object Recognition,'' Research Monograph Center for Imaging Sciences, 1997. 3. U. Grenander and M. I. Miller, "Computational Anatomy: An Emerging Discipline,'' Quarterly of Applied Mathematics, pp. 617-694, 1998. (*) This work was supported by Grant ARO DAAH-04-95-1-0494, ONR-MURI N00014-98-1-0606.
March 23, 1999
Douglas W. Oard, University of Maryland
AbstractIt is becoming increasingly easy to acquire and maintain large audio collections, and audio search techniques that incorporate speech recognition technology are evolving rapidly. Less is known, however, about the strategies that users will adopt when searching large audio collections. Information retrieval is a synergistic process in which the user and the system seek to exploit each other's strengths and cover each other's weaknesses, and audio retrieval shifts this balance in several ways when compared with text retrieval. In this talk I will describe some important differences between retrieval of speech and written text and explore the design space in which a new balance might be found. I will then describe the VoiceGraph project at the University of Maryland in which we are using iterative prototyping to design audio retrieval interfaces. I'll conclude with a few remarks on how what we learn in the VoiceGraph project might inform future work on component technologies such as speech recognition, speaker identification, and topic boundary detection.
Speaker BiographyDouglas Oard is an Assistant Professor in the College of Library and Information Services at the University of Maryland. His research interests center around the use of emerging technologies to support information seeking by end users, with present projects investigating audio retrieval, cross-language text retrieval, and the exchange of ratings by networked users. Additional information is available at http://www.glue.umd.edu/~oard/.
March 30, 1999
Pamela Abshire, Johns Hopkins University
AbstractWe seek to gain better understanding of sensory information processing in physical systems both natural and engineered. For an information processing system the statistics of the input, the details of the algorithm, and the task requirements determine the minimum information transmission rate, R (bits/sec). The properties of the physical substrate, such as bandwidth, noise and constraints on the signal value, determine the channel capacity, C (bits/sec). These physical properties must be such that C > R for reliable performance. Furthermore, any implementation involves tradeoffs among costs such as power, speed, accuracy, and area. We seek better understanding of the compromise between performance and cost. To date there exist several measurements of the information transmission rates for spiking and non-spiking neurons. These measurements must be supported by the physical properties of the communication channel, and we explore this relationship for one of these systems, the blowfly retina. We construct a communication channel model that incorporates all physical transformations from photons at the photoreceptor to the membrane voltage of the large monopolar cell in the laminar layer. In this talk I will begin with a brief historical review of previous work that employs information theoretic ideas to analyze neural information processing. I will then describe the components of the early vision system in the blowfly retina. From biophysical data available in the literature, we determine bandwidth limitations and noise contributions at the different stages and calculate the Shannon capacity of the system. We compare our model with empirical information capacities derived from measurements on the system. I will conclude my talk by briefly discussing future work aimed at determining the energy efficiency of the system as given by bit-energy, the ratio of information rate to power dissipated. References: P. Abshire and A.G. Andreou, "Relating Information Capacity to a Biophysical Model of the Blowfly Retina", Electrical and Computer Engineering, Technical Report 13-1998. A.G. Andreou and P.M. Furth, "An Information Theoretic Framework for Comparing the Bit-Energy of Signal Representations at the Circuit Level," Chapter 17, Low-Voltage/Low-Power Integrated Circuits and Systems, edited by Edgar Sanchez-Sinencio and Andreas G. Andreou, IEEE Press, 1998. Work supported by a DARPA/ONR Multidisciplinary University Research Initiative with Boston University on Automated Sensing and Vision Systems N00014-95-1-0409.
April 6, 1999
Jim Reeds, AT&T Research Labs
AbstractThe late 15th century Voynich manuscript is written in an unknown cipher script that has resisted all attempts at reading for 80 years. But maybe it's really not a cipher at all but a hoax, a madman's scribbles, written in an invented language, or a written equivalent of glossolallia. The Book of Soyga, studied by John Dee (1527-1608), is a 16th century magic treatise with a hidden mathematical surprise. And Book 3 of the 1500 Steganographia by Johannes Trithemius (1462-1516) contains recently discovered hidden cipher messages. Jim Reeds will tell the tale of his adventures with these puzzles.
Speaker BiographyJim Reeds earned a BA in 1969 from the University of Michigan, an MA in 1972 from Brandeis, both in mathematics, and a PhD in statistics from Harvard in 1976. From 1977 to 1982 he taught statistics at UC Berkeley and from 1983 to the present worked in the mathematics research center at Bell Labs and (since 1996) its successor, AT&T Labs Research. He has been interested in cryptanalysis since about 1955, in the Voynich manuscript since 1967, and in a vague way, in Trithemius since 1973, but his day job concerns cell phone privacy and authentication.
April 13, 1999
Graeme Hirst and Philip Edmonds, University of Toronto
AbstractPlesionyms, or near-synonyms, are words, that, within or across languages, are almost synonyms---but not quite. Some examples: "forest", "woods", German "Wald"; "fib", "lie", "misrepresentation". Near-synonyms may differ in one or more of the following: connotation, emphasis on subcomponents, implicature, denotation, speaker's expressed attitude, register, and structural or selectional requirements. In all but the last two of these, the distinction between two near-synonyms is at least in part conceptual. It is necessary to represent lexical meaning finely enough that distinctions between near-synonyms can adequately be taken into account in such tasks as lexical choice in machine translation and mono- and multilingual text generation. This is the basis for an alternative to conventional models of the relationship between words and concepts: a coarse-grained hierarchy in which clusters of near-synonyms are distinguished by explicit differentiae. This model is implemented in a system for lexical choice that is envisioned as a component of high-quality machine translation.
April 18, 1999
April 19, 1999
April 19, 1999
April 19, 1999
April 19, 1999
April 20, 1999
Jim Flanagan, Rutgers University
AbstractMass deployment of computing technology depends upon ease-of-use and natural human/machine interfaces. Communication methods based upon the sensory modes preferred by the human--sight, sound and touch--are consequently being developed. While as yet primitive, multimodal interfaces depend centrally upon conversational interaction. And, integration of speech with visual and tactile capabilities centers on fusing simultaneous sensory inputs (which are often redundant, ambiguous or contradictory) to achieve reliable interpretation and action. This report describes one in-progress research effort in multimodal interfaces that employs simultaneous eye tracking, visual gesture, hands-free sound capture, speech recognition, text-to-speech synthesis, tactile force-feedback and manual gesture.
Speaker BiographyJames Flanagan is Vice President for Research at Rutgers University. He is also Board of Governors Professor in Electrical and Computer Engineering. Rutgers is the State University of New Jersey, with an enrollment of 48,000 and a faculty and staff of 8,000. Flanagan joined Rutgers in 1990 after extended service in research and research management at AT&T Bell Laboratories. He was previously Director of Information Principles Research, with responsibilities in digital communications and information systems. Flanagan holds the S.M. and Sc.D. degrees in Electrical Engineering from the Massachusetts Institute of Technology. He has specialized in voice communications, computer techniques and electroacoustic systems, and has authored approximately 200 papers, 2 books, and 50 patents in these fields. Flanagan is a Fellow of the IEEE, the Acoustical Society of America, and the American Academy of Arts and Sciences. He has received a number of technical awards, and is a member of the National Academy of Engineering and the National Academy of Sciences. In 1996 he received the National Medal of Science at the White House.
August 4, 1999
Kenneth Ward Church, Department Head, AT&T Labs-Research
AbstractRepetition is very common. Adaptive language models were introduced to account for the fact that words (and their variant forms) tend to appear in bursts. We will show that this is especially true for words with a lot of content such as proper nouns, technical terminology and good keywords for information retrieval. A proper noun like ``Kennedy'' is more likely to be repeated in a Brown Corpus document than a common noun like ``showed,'' even though both words are about equally frequent. We find that words (and ngrams) with more content tend to be more bursty than words (and ngrams) with less content, all other things being equal. Measures borrowed from Information Retrieval, term frequency and document frequency, will be used to predict both the average frequency and the variance (burstiness) of a word. The literature on adaptive language models has studied the first moment in considerable detail, but has tended to ignore the second moment.
August 11, 1999
October 5, 1999
Geoffrey Zweig, T.J. Watson Research Center, IBM
AbstractThis talk will describe Bayesian networks, and place them in the context of automatic speech recognition. The Bayesian network formalism has both representational and algorithmic components, and the talk will touch on each. Representationally, the networks provide a graphical way of factoring a joint probability distribution. The nodes in a Bayes net graph represent random variables, whose values can be either known or unknown. The arcs in the graph factor the joint distribution into a product of localized conditional probabilities, each of which involves only a few variables. The conditional probabilities can be represented with either tables, Gaussians, or any other convenient function. Algorithmically, there are elegant and efficient procedures for computing marginal distributions over the values of the hidden variables, and for finding the likeliest assignment of values. In ASR, these lead directly to the computation of state-occupancy probabilities and Viterbi decodings. The key feature of Bayesian networks is that the algorithms are parameterized on the graph structure and the representation of conditional probabilities. This makes it very easy to explore a variety of probabilistic models with a minimum of code-writing. In addition to describing the basic algorithms, the talk will relate Bayesian networks to the HMMs currently in use in ASR, and show how they provide a simple method for extending HMMs to model phenomena such as rate-of-speech and articulatory motion.
October 7, 1999
Daniel Marcu, Information Sciences Institute and Department of Computer Science University of Southern California
AbstractResearchers of natural language have repeatedly acknowledged that coherent texts are not just simple sequences of sentences. Rather, they are complex artifacts whose semantic units are connected by rhetorical, logical, argumentative, and cohesive relations. I present research in theoretical, empirical, and applied computational linguistics that aims at uncovering the constraints that characterize the abstract structure of well-formed texts, and at producing algorithms for the automatic derivation of these structures. I show how automatically constructed discourse structures are exploited in a text summarization system and discuss other text processing open problems that can be properly addressed in a discourse-based framework
October 19, 1999
“Advances and challenges in speech, audio and acoustics processing for multimedia and human-machine communications”
B.H. Juang, Acoustics & Speech Research Department at Bell Labs, Lucent Technologies
AbstractTelecommunication in recent years is experiencing a revolution in progress. The advant of Internet and its exponentially rapid growth has triggered an entirely new thinking in terms of the means to achieve communication. The old paradigm of telephony has been shifted or broadened to video conferencing, distance learning, and remote collaboration and access to multimedia databases, offering both flexibility and richness in media content, management and human-machine interface. While significant contributions to this bright new world of telecommunciation come from infrastructure technologies such as switches, routers and optical networks, equally important, if not more so, is the advance in various signal processing areas which form the drivers of the new paradigm. These include high quality acoustics, audio processing & distribution, speech recognition, biometric authentication and signal synthesis. Systems that integrate these technologies will further bring us mobility, convenience and functionality that will drive the need of communication bandwidth and quality in the future. In this talk, I'll summarize advances made during the last few years in these signal processing technical areas and highlight challenges ahead of us that need to be overcome to realize the ultimate dream of a multimedia era with natural human-machine interface.
Speaker BiographyDr. Biing-Hwang Juang is Head of the Acoustics & Speech Research Department at Bell Labs, Lucent Technologies. He is engaged in a wide range of communication related research activities, from speech coding, speech recognition to multimedia communications. He has published extensively and holds a number of patents in the area of speech communication and communication services. He is co-author with Larry Rabiner of the book "Fundamentals of Speech Recognition" published by Prentice-Hall. He received the 1993 Best Senior Paper Award, the 1994 Best Senior Paper Award, and the 1994 Best Signal Processing Magazine Paper Award, and was co-author of a paper granted the 1994 Best Junior Paper Award, all from the IEEE Signal Processing Society. In 1997, he won the Bell Labs' President Award for leading the Bell Labs Automatic Speech Recognition (BLASR) team. He also received the prestigious 1998 Signal Processing Society's Technical Achievement Award and was named the Society's 1999 Distinguished Lecturer. He is a Fellow of the IEEE.
November 2, 1999
Dr. Andrew Kehler, Artificial Intelligence Center at SRI International
AbstractThe Information Extraction (IE) task has driven a substantial body of natural language processing research during the past decade. We begin by presenting an overview of IE, including a description of the class of problems IE addresses, the manner in which IE systems are generally evaluated, and a typical IE system architecture. After a brief overview of previous work in applying machine learning techniques to IE problems, we describe some ambitious (and largely unsuccessful) attempts to learn discourse interpretation strategies within such a system. The results of this work led to a re-examination of the evaluation metrics used to drive the learning process, which has revealed some unforeseen attributes that should be avoided in future evaluation schemes.
Speaker BiographyAndrew Kehler received his Ph.D. in Computer Science from Harvard University in 1995, and is currently Senior Computer Scientist in the Artificial Intelligence Center at SRI International. His computational linguistics research has focused primarily on applying machine learning techniques to discourse interpretation problems in naturally-occurring data. His linguistic research has also centered on discourse processing, addressing problems in ellipsis, reference, and coherence resolution. In March, 2000, he will join the linguistics faculty at the University of California San Diego.
November 9, 1999
Fernando J. Pineda, Applied Physics Laboratory at Johns Hopkins University
AbstractBayesian belief networks (BBNs) are a class of joint distribution functions that have a graphical representation. Inference with BBN's is performed by propagating probabilities throughout the graph via repeated application of Bayes' rule. Both inference and approximate inference in general BBNs is known to be NP-hard and thus BBNs cannot be applied to large-scale systems without the development of efficient approximate algorithms. The similarity between certain parameterized BBNs (e.g. sigmoid belief networks) and complex physical systems (e.g. the Ising model of ferromagnetism) recently prompted the successful application variational methods to the problem of approximate inference with parameterized BBNs. After a brief overview, this talk will focus on two novel approximate inference algorithms for a class of parameterized belief networks. The algorithms are a consequence of the application of saddle-point methods from statistical physics. The first algorithm yields a previously unknown and easy to calculate upper bound on posterior probabilities. The second algorithm is a Gaussian approximation that is significantly more precise than upper- and lower-bound techniques and takes into account correlations between random variables. If time permits, the proposed application of these algorithms to the problem of rapid microorganism identification via database search will be discussed.
Speaker BiographyDr. Fernando Pineda is a member of the principal profession staff at the Johns Hopkins Applied Physics Laboratory, a part time lecturer in the JHU department of Computer Science, and a collaborator on various research projects in the JHU department of Electrical and Computer Engineering. He has served on the editorial boards of Neural computation, IEEE Transactions on Neural Networks,Neural Networks, Applied Intelligence, and the APL Technical Digest. He has interests in physics, machine learning, neural networks, bioinformatics and analog VLSI.
November 16, 1999
John G. Harris, University of Florida, UF Analog Computation Group
AbstractThe ratio spectrum is a novel spectral representation that has shown promise for improved speech compression, coding, feature extraction and recognition. In effect, the ratio spectrum combines the standard front-end filter bank with a feature extraction process to produce a model that requires dramatically less hardware (or software). Alternatively, the model can be viewed as a small set of constant-Q filters whose center frequencies adapt to locations of high signal energy. The resulting feature vectors are shown to outperform several competing techniques for phoneme recognition (e.g. LPC and cepstrum). We also have implemented speech and audio coding using the ratio spectrum and standard spectrum inversion techniques. Finally, results from fabricated CMOS analog VLSI circuits illustrate a hardware efficient method to sample the ratio spectrum, paving the way for ultra low-power front-end speech processing and feature extraction.
Speaker BiographyDr. John G. Harris earned his BS and MS degrees in Electrical Engineering from MIT in 1983 and 1986 where he studied massively parallel vision algorithms. He then worked for one year at the Hughes Research Labs in Malibu, CA implementing perception algorithms for the DARPA Autonomous Land Vehicle. In 1987, Dr. Harris joined the interdisciplinary Computation and Neural Systems Program at Caltech. He earned his PhD in 1991 from Caltech developing novel silicon vision systems. After a two-year post doc at the MIT Artificial Intelligence Lab, Dr. Harris joined the faculty of the University of Florida in 1993 where he is currently an Associate Professor in Electrical and Computer Engineering. Dr. Harris leads the UF Analog Computation Group in researching biologically-inspired signal processing and analog VLSI sensory processing. He is the recipient of an NSF CAREER Award as well as a UF Teaching Improvement Program award.
November 23, 1999
Bertrand Delgutte, MIT, Eaton-Peabody Laboratory, Mass. Eye and Ear Infirmary
November 30, 1999
Stephen Grossberg, Department of Cognitive and Neural Systems, Boston University
AbstractWhat is the neural representation of a speech code as it evolves in time? How do listeners integrate temporally distributed phonemic information into coherent representations of syllables and words? How does the brain extract invariant properties of variable-rate speech? This talk will describe an emerging neural model that suggests answers to these questions, while quantitatively simulating challenging data about speech and word recognition. In this model, rate-dependent category boundaries emerge from feedback interactions between a working memory for short-term storage of phonetic items and a list categorization network for grouping sequences of items. The conscious speech and word recognition code is suggested to be a resonant wave. Such a wave emerges when sequential activation and storage of phonemic items in working memory provides bottom-up input to unitized representations, or list chunks, that group together sequences of items of variable length. The list chunks compete with each other as they dynamically integrate this bottom-up information. The winning groupings feed back to provide top-down support to their phonemic items. These top-down expectations amplify and focus attention on consistent working memory items, while suppressing inconsistent working memory items. Feedback establishes a resonance which temporarily boosts the activation levels of selected items and chunks, thereby creating an emergent conscious percept. Because the resonance evolves more slowly than working memory activation, it can be influenced by information presented after relatively long intervening silence intervals. Variations in the durations of speech sounds and silent pauses can hereby produce different perceived groupings of words, and future sounds can influence how we hear past sounds. Preprocessing of acoustic signals into parallel auditory streams that respond preferentially to transient and sustained properties of the acoustic signal before being stored in parallel working memories, together with cross-stream automatic gain control, can help to explain how an invariant speech representation can emerge from variable-rate speech. References: Boardman, I., Grossberg, S., Myers, C., and Cohen, M. (1998). Neural dynamics of perceptual order and context effects for variable-rate speech syllables. Perception & Psychophysics, in press. Grossberg, S., Boardman, I., and Cohen, M. (1997). Neural dynamics of variable-rate speech categorization. J. Exptal. Psychol.: Human Percept. & Perform., 23, 481-503. Grossberg, S. and Myers, C. (1999). The resonant dynamics of speech perception: Interword integration and duration-dependent backward effects. Psychological Review, in press.
December 7, 1999
K.G. Munhall, Queen's University, Canada: Department of Psychology