Chiori Hori (MERL): Multimodal Dialog Technologies Toward Human-Robot Communication
Natural spoken language interaction between humans and robots has been a long-standing dream of artificial intelligence. Recently, spoken dialog technologies have been applied in real-world man-machine interfaces including smartphone digital assistants, car navigation, voice-controlled speakers, and human-facing robots. Traditional dialog systems rely on hand-crafted rules to support a limited task domain, such as a query of information from a database. In this talk, we introduce deep learning architectures that combine spoken dialog technologies and multimodal attention-based video description technologies to realize a novel Audio-Visual Scene-Aware Dialog (AVSD) framework. These models can generate unified semantic representations of natural language and audio-visual inputs, which facilitate flexible discourse about a scene. Our goal for AVSD is to identify and detail the events in the video through dialog. Experiments are conducted based on dialogs consisting of 10 QAs and a summary for the Charades dataset, which captures people performing everyday actions in real-world settings with natural audio. This work represents a key step toward real-world human-robot interaction and will be a focal point of the 7th Dialog System Technology Challenge (DSTC7).
Dr. Chiori Hori worked on spoken language processing technologies since 1998. She built Spoken interactive QA using a real-time Automatic Speech Recognition (ASR) based on Weighted Finite-State Transducer (WFST) with over-a-million word vocabulary at NTT in 2002. She joined CMU to work on speech summarization and translation in 2004 and then moved to ATR/NICT in 2007. She led the U-STAR consortium consisting of 30 research institutes from 25 countries/regions to construct a network-based speech-to-speech translation system at NICT from 2010 to 2014. She led the NICT ASR research group and their system to first place in the English TED talk recognition at IWSLT for three consecutive years from 2012. She invented a WFST-based dialog technology and implemented it in a humanoid robot ASIMO in 2014. She joined MERL in 2015 to work on neural network based technologies for Human-Robot communication.
This lecture is part of the closing day presentations for the 2018 Frederick Jelinek Memorial Summer Workshop.