When confronted with the daunting task of transmitting speech information to deaf individuals, one comes quickly to the conclusion that the solution to this problem requires a full-blown theory of speech perception. Because the bandwidth and dynamic range of speech far exceeds the capacity of the deaf ear, radical recoding of important speech information and sensory substitution schemes have been proposed. Within this framework, at least four major questions must be addressed: 1) What are the essential elements of the signal that must be transmitted? 2) What is the information capacity of the receiving sensory system? 3) Does the information capacity of the receiving system match (or exceed) the demands of the signal(s) being transmitted, and if it doesn’t, how should the signal information be recoded to be better matched to the receiving systems capabilities? 4) What methods will be used to evaluate the success (or failure) of the enterprise? The advantage of dissecting the problem into these 4 crucial questions is that one can develop a systematic approach to understanding speech recognition that applies equally to sensory substitution such as tactile speech aids, advanced bionics such as cochlear implants, or hearing aids. For this talk, I will present several examples of bimodal and unimodal speech recognition where high levels of intelligibility are achieved with minimal auditory information or by incorporating visual speech information gleaned from lipreading (i.e., spreechreading). In the bimodal examples, the amount of transmitted auditory speech information is insufficient to support word or sentence intelligibility (zero percent correct), and the average speechreading performance, even for the very best speechreader (who is usually a deaf individual) might be 10-30% word or sentence intelligibility. Similar findings have been shown for auditory-only speech inputs for signals composed of disjoint and non-overlapping spectral bands where over 90% of the spectral information has been discarded. The very fact that high levels of speech intelligibility (>80%) can be achieved with multimodal inputs where auditory and visual modalities individually fail to transmit enough information to support speech perception and for unimodal inputs composed of combinations of spectral bands where individual bands provide minimal acoustic information may suggest novel approaches to automatic speech recognition.
Traditional automatic speech recognition (ASR) systems are comprised of an acoustic model (AM), a pronunciation model (PM) and a language model (LM), all of which are independently trained, and often manually designed, on different datasets. Over the last several years, there has been a growing popularity in developing end-to-end systems, which attempt to learn these separate components jointly as a single system. While these end-to-end models have shown promising results in the literature, it is not yet clear if such approaches can improve on current state-of-the-art conventional systems. In this talk, I will discuss various algorithmic and systematic improvements we have explored in developing a new end-to-end model that surpasses the performance of a conventional production system. I will also discuss promising results with multi-lingual and multi-dialect end-to-end models. Finally, I will discuss current challenges with these models and future research directions.
Tara Sainath received her PhD in Electrical Engineering and Computer Science from MIT in 2009. The main focus of her PhD work was in acoustic modeling for noise robust speech recognition. After her PhD, she spent 5 years at the Speech and Language Algorithms group at IBM T.J. Watson Research Center, before joining Google Research. She has served as a Program Chair for ICLR in 2017 and 2018. Also, she has co-organized numerous special sessions and workshops, including Interspeech 2010, ICML 2013, Interspeech 2016 and ICML 2017. In addition, she is a member of the IEEE Speech and Language Processing Technical Committee (SLTC) as well as the Associate Editor for IEEE/ACM Transactions on Audio, Speech, and Language Processing. Her research interests are mainly in acoustic modeling, including deep neural networks, sparse representations and adaptation methods.
Discourse relations such as ‘contrast’, ‘cause’ or ‘evidence’ are often postulated to explain how humans understand the function of one sentence in relation to another. Some relations are signaled rather directly using words such as “because” or “on the other hand”, but often signals are highly ambiguous or remain implicit, and cannot be associated with specific words. This opens up questions regarding how exactly we recognize relations and what kinds of computational models we can build to account for them.
In this talk I will explore models capturing discourse signals in the framework of Rhetorical Structure Theory (Mann & Thompson 1988), using data from the RST Signaling Corpus (Taboada & Das 2013) and a richly annotated corpus called GUM (Zeldes 2017). Using manually annotated data indicating the presence of lexical and implicit signals, I will show that purely text based models using RNNs and word embeddings inevitably miss important aspects of discourse structure. I will argue that richly annotated data beyond the textual level, including syntactic and semantic information, is required to form a more complete picture of discourse relations in text.
Amir Zeldes is assistant professor of Computational Linguistics at Georgetown University, specializing in Corpus Linguistics. He studied Cognitive Science, Linguistics and Computational Linguistics in Jerusalem, Potsdam, and Berlin, receiving his PhD in Linguistics from Humboldt University in 2012. His interests center on the syntax-semantics interface, where meaning and knowledge about the world are mapped onto language-specific choices. His most recent work focuses on computational discourse models which reflect common ground and communicative intent across sentences. He is also involved in the development of tools for corpus search, annotation and visualization, and has worked on representations of textual data in Linguistics and the Digital Humanities.