Carlos Busso (University of Texas at Dallas) “Multimodal Machine Learning for Human-Centric Tasks”
3400 N. Charles Street
The almost unlimited multimedia content available on video-sharing websites has opened new challenges and opportunities for building robust multimodal solutions. This seminar will describe our novel multimodal architectures that (1) are robust to missing modalities, (2) can identify noisy or less discriminative features, and (3) can leverage unlabeled data. First, we present a strategy that effectively combines auxiliary networks, a transformer architecture, and an optimized training mechanism for handling missing features. This problem is relevant since it is expected that during inference the multimodal system will face cases with missing features due to noise or occlusion. We implement this approach for audiovisual emotion recognition achieving state-of-the-art performance. Second, we present a multimodal framework for dealing with scenarios characterized by noisy or less discriminative features. This situation is commonly observed in audiovisual automatic speech recognition (AV-ASR) with clean speech, where the performance often drops compared to a speech-only solution due to the variability of visual features. The proposed approach is a deep learning solution with a gating layer that diminishes the effect of noisy or uninformative visual features, keeping only useful information. The approach improves, or at least, maintains performance when visual features are used. Third, we discuss alternative strategies to leverage unlabeled multimodal data. A promising approach is to use multimodal pretext tasks that are carefully designed to learn better representations for predicting a given task, leveraging the relationship between acoustic and facial features. Another approach is using multimodal ladder networks where intermediate representations are predicted across modalities using lateral connections. These models offer principled solutions to increase the generalization and robustness of common speech-processing tasks when using multimodal architectures.
Carlos Busso is a Professor at the University of Texas at Dallas’s Electrical and Computer Engineering Department, where he is also the director of the Multimodal Signal Processing (MSP) Laboratory. His research interest is in human-centered multimodal machine intelligence and application, with a focus on the broad areas of affective computing, multimodal human-machine interfaces, in-vehicle active safety systems, and machine learning methods for multimodal processing. He has worked on audio-visual emotion recognition, analysis of emotional modulation in gestures and speech, designing realistic human-like virtual characters, and detection of driver distractions. He is a recipient of an NSF CAREER Award. In 2014, he received the ICMI Ten-Year Technical Impact Award. In 2015, his student received the third prize IEEE ITSS Best Dissertation Award (N. Li). He also received the Hewlett Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain), and the Best Paper Award at the AAAC ACII 2017 (with Yannakakis and Cowie). He received the Best of IEEE Transactions on Affective Computing Paper Collection in 2021 (with R. Lotfian) and the Best Paper Award from IEEE Transactions on Affective Computing in 2022 (with Yannakakis and Cowie). He received the ACM ICMI Community Service Award in 2023. In 2023, he received the Distinguished Alumni Award in the Mid-Career/Academia category by the Signal and Image Processing Institute (SIPI) at the University of Southern California. He is currently serving as an associate editor of the IEEE Transactions on Affective Computing. He is an IEEE Fellow. He is a member of the ISCA, and AAAC and a senior member of ACM.