Model robustness and spurious correlations have received increasing attention in the NLP community, both in methods and evaluation. The term “spurious correlation” is overloaded though and can refer to any undesirable shortcuts learned by the model, as judged by domain experts.
When designing mitigation algorithms, we often (implicitly) assume that a spurious feature is irrelevant for prediction. However, many features in NLP (e.g. word overlap and negation) are not spurious in the sense that the background is spurious for classifying objects in an image. In contrast, they carry important information that’s needed to make predictions by humans. In this talk, we argue that it is more productive to characterize features in terms of their necessity and sufficiency for prediction. We then discuss the implications of this categorization in representation, learning, and evaluation.
He He is an Assistant Professor in the Department of Computer Science and the Center for Data Science at New York University. She obtained her PhD in Computer Science at the University of Maryland, College Park. Before joining NYU, she spent a year at AWS AI and was a post-doc at Stanford University before that. She is interested in building robust and trustworthy NLP systems in human-centered settings. Her recent research focus includes robust language understanding, collaborative text generation, and understanding capabilities and issues of large language models.
Modern learning architectures for natural language processing have been very successful in incorporating a huge amount of texts into their parameters. However, by and large, such models store and use knowledge in distributed and decentralized ways. This proves unreliable and makes the models ill-suited for knowledge-intensive tasks that require reasoning over factual information in linguistic expressions. In this talk, I will give a few examples of exploring alternative architectures to tackle those challenges. In particular, we can improve the performance of such (language) models by representing, storing and accessing knowledge in a dedicated memory component.
This talk is based on several joint works with Yury Zemlyanskiy (Google Research), Michiel de Jong (USC and Google Research), William Cohen (Google Research and CMU) and our other collaborators in Google Research.
Fei is a research scientist at Google Research. Before that, he was a Professor of Computer Science at University of Southern California. His primary research interests are machine learning and its application to various AI problems: speech and language processing, computer vision, robotics and recently weather forecast and climate modeling. He has a PhD (2007) from Computer and Information Science from U. of Pennsylvania and B.Sc and M.Sc in Biomedical Engineering from Southeast University (Nanjing, China).
Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but also demonstrate impressive domain and task generalization capabilities. In contrast, audio generative models are relatively primitive in scale and generalization.
In this talk, I will start with a brief introduction on conventional neural speech generative models and discuss why they are unfit for scaling to Internet-scale data. Next, by reviewing the latest large-scale generative models for text and image, I will outline a few lines of promising approaches to build scalable speech models. Last, I will present Voicebox, our latest work to advance this area. Voicebox is the most versatile generative model for speech. It is trained with a simple task — text conditioned speech infilling — on over 50K hours of multilingual speech with a powerful flow-matching objective. Through in-context learning, Voicebox can perform monolingual/cross-lingual zero-shot TTS, holistic style conversion, transient noise removal, content editing, and diverse sample generation. Moreover, Voicebox achieves state-of-the-art performance and excellent run-time efficiency.
Wei-Ning Hsu is a research scientist at Meta Foundational AI Research (FAIR) and currently the lead of the audio generation team. His research focuses on self-supervised learning and generative models for speech and audio. His pioneering work includes HuBERT, AV-HuBERT, TextlessNLP, data2vec, wav2vec-U, textless speech translation, and Voicebox.
Prior to joining Meta, Wei-Ning worked at MERL and Google Brain as a research intern. He received his Ph.D. and S.M. degrees in Electrical Engineering and Computer Science from Massachusetts Institute of Technology in 2020 and 2018, under the supervision of Dr. James Glass. He received his B.S. degree in Electrical Engineering from National Taiwan University in 2014, under the supervision of Prof. Lin-shan Lee and Prof. Hsuan-Tien Lin.