Abstract
Voice conversion (VC) is a significant aspect of artificial intelligence. It is the study of how to convert one’s voice to sound like that of another without changing the linguistic content. Voice conversion belongs to a general technical field of speech synthesis, which converts text to speech or changes the properties of speech, for example, voice identity, emotion, and accents. Voice conversion involves multiple speech processing techniques, such as speech analysis, spectral conversion, prosody conversion, speaker characterization, and vocoding. With the recent advances in theory and practice, we are now able to produce human-like voice quality with high speaker similarity. In this talk, Dr. Sisman will present the recent advances in voice conversion and discuss their promise and limitations. Dr. Sisman will also provide a summary of the available resources for expressive voice conversion research.
Biography
Dr. Berrak Sisman (Member, IEEE) received the Ph.D. degree in electrical and computer engineering from National University of Singapore in 2020, fully funded by A*STAR Graduate Academy under Singapore International Graduate Award (SINGA). She is currently working as a tenure-track Assistant Professor at the Erik Jonsson School Department of Electrical and Computer Engineering at University of Texas at Dallas, United States. Prior to joining UT Dallas, she was a faculty member at Singapore University of Technology and Design (2020-2022). She was a Postdoctoral Research Fellow at the National University of Singapore (2019-2020). She was an exchange doctoral student at the University of Edinburgh and a visiting scholar at The Centre for Speech Technology Research (CSTR), University of Edinburgh (2019). She was a visiting researcher at RIKEN Advanced Intelligence Project in Japan (2018). Her research is focused on machine learning, signal processing, emotion, speech synthesis and voice conversion.
Dr. Sisman has served as the Area Chair at INTERSPEECH 2021, INTERSPEECH 2022, IEEE SLT 2022 and as the Publication Chair at ICASSP 2022. She has been elected as a member of the IEEE Speech and Language Processing Technical Committee (SLTC) in the area of Speech Synthesis for the term from January 2022 to December 2024. She plays leadership roles in conference organizations and active in technical committees. She has served as the General Coordinator of the Student Advisory Committee (SAC) of International Speech Communication Association (ISCA).
Abstract
Abstract
Driven by the goal of eradicating language barriers on a global scale, machine translation has solidified itself as a key focus of artificial intelligence research today. However, such efforts have coalesced around a small subset of languages, leaving behind the vast majority of mostly low-resource languages. What does it take to break the 200 language barrier while ensuring safe, high-quality results, all while keeping ethical considerations in mind? In this talk, I introduce No Language Left Behind, an initiative to break language barriers for low-resource languages. In No Language Left Behind, we took on the low-resource language translation challenge by first contextualizing the need for translation support through exploratory interviews with native speakers. Then, we created datasets and models aimed at narrowing the performance gap between low and high-resource languages. We proposed multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Critically, we evaluated the performance of over 40,000 different translation directions using a human-translated benchmark, Flores-200, and combined human evaluation with a novel toxicity benchmark covering all languages in Flores-200 to assess translation safety. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art, laying important groundwork towards realizing a universal translation system in an open-source manner.
Biography
Angela is a research scientist at Meta AI Research in New York, focusing on supporting efforts in speech and language research. Recent projects include No Language Left Behind (https://ai.facebook.com/research/no-language-left-behind/) and Universal Speech Translation for Unwritten Languages (https://ai.facebook.com/blog/ai-translation-hokkien/). Before translation, Angela previously focused on research in on-device models for NLP and computer vision and text generation.
Abstract
Abstract
Multilingual machine translation has proven immensely useful for both parameter efficiency and overall performance for many language pairs via complete parameter sharing. However, some language pairs in multilingual models can see worse performance than in bilingual models, especially in the one-to-many translation setting. Motivated by their empirical differences, we examine the geometric differences in representations from bilingual models versus those from one-to-many multilingual models. Specifically, we measure the isotropy of these representations using intrinsic dimensionality and IsoScore, in order to measure how these representations utilize the dimensions in their underlying vector space. We find that for a given language pair, its multilingual model decoder representations are consistently less isotropic than comparable bilingual model decoder representations. Additionally, we show that much of this anisotropy in multilingual decoder representations can be attributed to modeling language-specific information, therefore limiting remaining representational capacity.
Abstract
In this talk, I will present a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training.
Bio
Florian Metze is a Research Scientist Manager at Meta AI in New York, supporting a team of researchers and engineers working on multi-modal (image, video, audio, text) content understanding for Meta’s Family of Apps (Instagram, Threads, Facebook, WhatsApp). He used to be an Associate Research Professor at Carnegie Mellon University, in the School of Computer Science’s Language Technologies Institute, where he still is an Adjunct Professor. He is also a co-founder of Abridge, a company working on extracting information from doctor patient conversations. His work covers many areas of speech recognition and multi-media analysis with a focus on end-to-end deep learning. Currently, he focuses on multi-modal processing of videos, and using that information to recommend unconnected content. In the past, he has worked on low resource and multi-lingual speech processing, speech recognition with articulatory features, large-scale multi-media retrieval and summarization, information extraction from medical interviews, and recognition of personality or similar meta-data from speech.
For more information, please see http://www.cs.cmu.edu/directory/fmetze
Abstract
The almost unlimited multimedia content available on video-sharing websites has opened new challenges and opportunities for building robust multimodal solutions. This seminar will describe our novel multimodal architectures that (1) are robust to missing modalities, (2) can identify noisy or less discriminative features, and (3) can leverage unlabeled data. First, we present a strategy that effectively combines auxiliary networks, a transformer architecture, and an optimized training mechanism for handling missing features. This problem is relevant since it is expected that during inference the multimodal system will face cases with missing features due to noise or occlusion. We implement this approach for audiovisual emotion recognition achieving state-of-the-art performance. Second, we present a multimodal framework for dealing with scenarios characterized by noisy or less discriminative features. This situation is commonly observed in audiovisual automatic speech recognition (AV-ASR) with clean speech, where the performance often drops compared to a speech-only solution due to the variability of visual features. The proposed approach is a deep learning solution with a gating layer that diminishes the effect of noisy or uninformative visual features, keeping only useful information. The approach improves, or at least, maintains performance when visual features are used. Third, we discuss alternative strategies to leverage unlabeled multimodal data. A promising approach is to use multimodal pretext tasks that are carefully designed to learn better representations for predicting a given task, leveraging the relationship between acoustic and facial features. Another approach is using multimodal ladder networks where intermediate representations are predicted across modalities using lateral connections. These models offer principled solutions to increase the generalization and robustness of common speech-processing tasks when using multimodal architectures.
Bio
Carlos Busso is a Professor at the University of Texas at Dallas’s Electrical and Computer Engineering Department, where he is also the director of the Multimodal Signal Processing (MSP) Laboratory. His research interest is in human-centered multimodal machine intelligence and application, with a focus on the broad areas of affective computing, multimodal human-machine interfaces, in-vehicle active safety systems, and machine learning methods for multimodal processing. He has worked on audio-visual emotion recognition, analysis of emotional modulation in gestures and speech, designing realistic human-like virtual characters, and detection of driver distractions. He is a recipient of an NSF CAREER Award. In 2014, he received the ICMI Ten-Year Technical Impact Award. In 2015, his student received the third prize IEEE ITSS Best Dissertation Award (N. Li). He also received the Hewlett Packard Best Paper Award at the IEEE ICME 2011 (with J. Jain), and the Best Paper Award at the AAAC ACII 2017 (with Yannakakis and Cowie). He received the Best of IEEE Transactions on Affective Computing Paper Collection in 2021 (with R. Lotfian) and the Best Paper Award from IEEE Transactions on Affective Computing in 2022 (with Yannakakis and Cowie). He received the ACM ICMI Community Service Award in 2023. In 2023, he received the Distinguished Alumni Award in the Mid-Career/Academia category by the Signal and Image Processing Institute (SIPI) at the University of Southern California. He is currently serving as an associate editor of the IEEE Transactions on Affective Computing. He is an IEEE Fellow. He is a member of the ISCA, and AAAC and a senior member of ACM.