2020 JSALT Plenary Talks

Seventh Frederick Jelinek Memorial Summer Workshop

Fri. July 3rd, 10 AM to 11:15 AM

Speaker: Zhou Yu, UC Davis.

Title: Seamless Natural Communication between Humans and Machines

Location: Follow this link to watch a recording of the presentation.

Abstract: Dialog systems such as Alexa and Siri are everywhere in our lives. They can complete tasks such as booking flights, making restaurant reservations and training people for interviews. However, currently deployed dialog systems are rule-based and cannot generalize to different domains, let alone flexible dialog context tracking. We will first discuss how to design studies to collect realistic dialogs through a crowdsourcing platform. Then we introduce a dialog model that utilizes limited data to achieve good performance by leveraging multi-task learning and semantic scaffolds. We further improve the model’s coherence by tracking both semantic actions and conversational strategies from dialog history using finite-state transducers. Finally, we analyze some ethical concerns and human factors in dialog system deployment. All our work comes together to build seamless natural communication between humans and machines.

Zhou Yu

Bio: Zhou Yu is an Assistant Professor at the UC Davis Computer Science Department. Zhou will join the CS department at Columbia University in Jan 2021 as an Assistant Professor. She obtained her Ph.D. from Carnegie Mellon University in 2017. Zhou has built various dialog systems that have a real impact, such as a job interview training system, a depression screening system, and a second language learning system. Her research interest includes dialog systems, language understanding and generation, vision and language, human-computer interaction, and social robots. Zhou received an ACL 2019 best paper nomination, featured in Forbes 2018 30 under 30 in Science, and won the 2018 Amazon Alexa Prize.

 

 

 

Fri. July 10th, 10 AM to 11:15 AM

Speaker: Jonathan Le Roux, Mitsubishi Electric Research Laboratories

Title: Deep Learning for Multifarious Speech Processing: Tackling Multiple Speakers, Microphones, and Languages

Location: Follow this link to watch a recording of the presentation.

Abstract: Speech processing has been at the forefront of the recent deep learning revolution, with major breakthroughs in automatic speech recognition, speech enhancement, and source separation. I will give an overview of deep learning techniques developed at MERL towards the goal of cracking the Tower of Babel version of the cocktail party problem, that is, separating and/or recognizing the speech of multiple unknown speakers speaking simultaneously in multiple languages, in both single-channel and multi-channel scenarios: from deep clustering to chimera networks, phasebook and friends, and from seamless ASR to MIMO-Speech and Transformer-based multi-speaker ASR.

Jonathan Le Roux

Bio: Jonathan Le Roux is a Senior Principal Research Scientist and the Speech and Audio Senior Team Leader at Mitsubishi Electric Research Laboratories (MERL) in Cambridge, Massachusetts. He completed his B.Sc. and M.Sc. degrees in Mathematics at the Ecole Normale Supérieure (Paris, France), his Ph.D. degree at the University of Tokyo (Japan) and the Université Pierre et Marie Curie (Paris, France), and worked as a postdoctoral researcher at NTT’s Communication Science Laboratories from 2009 to 2011. His research interests are in signal processing and machine learning applied to speech and audio. He has contributed to more than 100 peer-reviewed papers and 20 granted patents in these fields. He is a founder and chair of the Speech and Audio in the Northeast (SANE) series of workshops, and a Senior Member of the IEEE.

Fri. July 17th, 10 AM to 11 AM

Speaker: Ivan Medennikov, STC-innovations Ltd.

Title: Overlapping speech diarization: from clustering to Target-Speaker VAD

Location: Follow this link to watch a recording of the presentation.

Abstract: Speaker diarization for real-world conditions is a challenging problem, and CHiME-6 dinner party scenario is an excellent example of such conditions. Spontaneous speech, distant microphones, and a high amount of speaker overlaps make diarization extremely hard. A conventional clustering-based diarization pipeline is not able to solve this problem.

Ivan Medennikov

In this talk, we will consider a novel diarization approach named Target-Speaker Voice Activity Detection, which allowed the STC team to achieve state-of-the-art results in the CHiME-6 Challenge. In essence, TS-VAD is a fusion of several concepts, namely Personal VAD, End-to-end Neural Diarization, and Target-Speaker ASR. The main advantage of our approach is its inherent ability to tackle overlapping speech.

We will talk about the evolution of TS-VAD, and also discuss limitations and perspectives of the approach.

Bio: Ivan Medennikov is a Leading Researcher at STC-innovations Ltd (Saint-Petersburg, Russia). He completed his Ph.D. degree at the Saint-Petersburg State University in 2016. His research focuses on various aspects of automatic speech recognition, primarily in far-field conditions. He is leading a team of 6 researchers at STC-innovations and supervising two Ph.D. students at the ITMO University.

Ivan’s team achieved 3rd place in the CHiME-5 Challenge (2018), 1st place in the VOiCES From a Distance Challenge (2019), and 1st place in the second track of CHiME-6 Challenge (2020). He is also an author of the Target-Speaker Voice Activity Detection approach solving diarization problem in the CHiME-6 dinner party scenario.

Wed. July 22nd, 10 AM to 11 AM

Speaker: Naoyuki Kanda, Microsoft Research

Title: Joint Modeling Approach for Rich Transcription

Location: Follow this link to watch a recording of the presentation.

Abstract: Rich transcription of speech, which recognizes not only words but also various metadata such as speaker marks, has a long research history for automatic meeting analysis. While significant progress has been made in the research field, most existing systems consist of multiple independent modules, which entails suboptimality of the overall performance. In this talk, I will introduce recent efforts of joint modeling for rich transcription from joint training of speech separation and speech recognition to our recent research on the end-to-end speaker-attributed ASR that includes speaker counting, speech separation, speech recognition and speaker identification in one unified architecture.

Naoyuki Kanda

Bio: Naoyuki Kanda is a Principal Researcher at Microsoft Research in Redmond, WA. His research interests include automatic speech recognition (ASR) and a wide range of spoken language technologies such as speaker diarization, spoken dialog systems and spoken document retrieval systems. His algorithms and systems for far-field multi-talker conversational ASR won the first prize at the IWSLT English evaluation campaign in 2014, and the second prize at the CHiME-5 speech recognition competition in 2018. He received a B.S. in Engineering, M.S. in Informatics, and Ph.D. in Informatics from Kyoto University, Japan, in 2004, 2006, and 2014, respectively. From 2006 to 2019, he served with Hitachi Ltd. in Tokyo, Japan, and held appointments as a Research Expert (2014-2016) and Cooperative Visiting Researcher (2016-2017) at the National Institute of Information and Communications Technology (NICT) in Kyoto, Japan. He is a member of the Institute of Electrical and Electronics Engineers (IEEE), the Acoustical Society of Japan (ASJ), the Japanese Society for Artificial Intelligence (JSAI), and the Information Processing Society of Japan (IPSJ).

Fri. July 24th, 10 AM to 11 AM

Speaker: Milica Gašić, Heinrich-Heine-Universität Düsseldorf

Title: 10 things you should know about dialogue

Location: Follow this link to watch a recording of the presentation.

Abstract: In recent years we have seen a surge of research concerning conversational AI, driven by the renaissance of deep learning and its sweeping results across the AI spectrum: vision, robotics, speech processing and NLP. In this talk I aim to emphasise the most important aspects of dialogue and achievements of dialogue research, focusing in particular on the research on statistical dialogue modelling preceding the deep learning boom. I will talk about the difference between task-oriented and chat-based systems, the importance of human-in-the-loop evaluation as well as concepts such as tracking and planning. I will review some of the key publications in the field and give an overview of important data-sets and toolkits.

Milica Gašić

Bio: Milica Gašić is a Professor of Dialogue Systems and Machine Learning at Heinrich Heine University Düsseldorf. Her research focuses on fundamental questions of human computer dialogue modelling, and lies in the intersection of Natural Language Processing and Machine Learning. Prior to her current position, she was a Lecturer in Spoken Dialog Systems at the Department of Engineering, University of Cambridge, where she was leading the Dialogue Systems Group. She received her PhD from University of Cambridge, winning an EPSRC PhD Plus Award for her thesis titled Statistical Dialogue Modelling. She holds an MPhil degree in Computer Speech, Text and Internet Technology from the University of Cambridge, and a Diploma in Mathematics and Computer Science from the University of Belgrade. She is the recipient of several best paper awards: CSL (2010), Interspeech (2010), SLT (2010), Sigdial (2013), Sigdial (2015), EMNLP (2015), ACL (2016), EMNLP (2018) and SIGDIAL (2020).

Center for Language and Speech Processing