Yuan Gong (MIT Computer Science and Artificial Intelligence Laboratory) “From Audio Perception to Understanding: A Path Towards Audio Artificial General Intelligence”

When:
April 1, 2024 @ 12:00 pm – 1:15 pm
2024-04-01T12:00:00-04:00
2024-04-01T13:15:00-04:00
Where:
Hackerman Hall B17
3400 N. Charles Street
Baltimore
MD 21218
Cost:
Free

Abstract

Our cognitive abilities enable us not only to perceive and identify speech and non-speech sounds but also to comprehend their meaning as a whole. While significant advancements have been achieved in audio recognition in recent years, models trained with merely sound labels possess limited reasoning and understanding capabilities, e.g., the model may recognize the clock chime 6 times, but not know that it indicates a time of 6 o’clock. Can we build an AI model that has both audio perception and reasoning ability?

In this talk, I will first briefly introduce the advantages and limitations of the Audio Spectrogram Transformer (AST) – a modern general audio perception model. Then I will dive deep into how we integrated an audio perception model and a text large language model (LLM) and built the first audio large language model called Listen, Think, and Understand (LTU). Finally, I will discuss how audio artificial general intelligence (Audio AGI) could reform applications in robotics, health, and education in the future, and how we should control the risks it might bring.

Bio

Yuan Gong is a research scientist in the Spoken Language Systems Group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), working on audio, speech, and natural language processing. He received his Ph.D. degree in computer science from the University of Notre Dame, IN, USA, and his B.S. degree in biomedical engineering from Fudan University, Shanghai, China in 2020 and 2015, respectively. He has published over 25 peer-reviewed papers in Interspeech, ICASSP, EMNLP, NAACL, ICLR, AAAI, ICCV, etc. These publications feature one paper nominated for the Best Student Paper Award at Interspeech 2019, one paper that was a finalist for the Best Paper Award at ASRU 2023, one paper that won the 2017 ACM Multimedia Audio-Visual Depression Detection Challenge, and one paper recognized among the notable top 25% papers at ICLR 2023, which also received coverage by MIT News.

Center for Language and Speech Processing