Our goal is to use AI to automatically find tax minimization strategies, an approach which we call “Shelter Check.” It would come in two variants. Existing-Authority Shelter Check would aim to find whether existing tax law authorities can be combined to create tax minimization strategies, so the IRS or Congress can shut them down. New-Authority Shelter Check would automate checking whether a new tax law authority – like proposed legislation or a draft court decision – would combine with existing authorities to create a new tax minimization strategy. We had initially had high hopes for GPT-* large language models for implementing Shelter Check, but our tests have showed that they do very poorly at basic legal reasoning and handling legal text. So we are now creating a benchmark and training data for LLM’s handling legal text, hoping to spur improvements.
Large-scale generative models such as GPT and DALL-E have revolutionized natural language processing and computer vision research. These models not only generate high fidelity text or image outputs, but also demonstrate impressive domain and task generalization capabilities. In contrast, audio generative models are relatively primitive in scale and generalization.
In this talk, I will start with a brief introduction on conventional neural speech generative models and discuss why they are unfit for scaling to Internet-scale data. Next, by reviewing the latest large-scale generative models for text and image, I will outline a few lines of promising approaches to build scalable speech models. Last, I will present Voicebox, our latest work to advance this area. Voicebox is the most versatile generative model for speech. It is trained with a simple task — text conditioned speech infilling — on over 50K hours of multilingual speech with a powerful flow-matching objective. Through in-context learning, Voicebox can perform monolingual/cross-lingual zero-shot TTS, holistic style conversion, transient noise removal, content editing, and diverse sample generation. Moreover, Voicebox achieves state-of-the-art performance and excellent run-time efficiency.
Wei-Ning Hsu is a research scientist at Meta Foundational AI Research (FAIR) and currently the lead of the audio generation team. His research focuses on self-supervised learning and generative models for speech and audio. His pioneering work includes HuBERT, AV-HuBERT, TextlessNLP, data2vec, wav2vec-U, textless speech translation, and Voicebox.
Prior to joining Meta, Wei-Ning worked at MERL and Google Brain as a research intern. He received his Ph.D. and S.M. degrees in Electrical Engineering and Computer Science from Massachusetts Institute of Technology in 2020 and 2018, under the supervision of Dr. James Glass. He received his B.S. degree in Electrical Engineering from National Taiwan University in 2014, under the supervision of Prof. Lin-shan Lee and Prof. Hsuan-Tien Lin.
Recent advances in speech technology make heavy use of pre-trained models that learn from large quantities of raw (untranscribed) speech, using “self-supervised” (ie unsupervised) learning. These models learn to transform the acoustic input into a different representational format that makes supervised learning (for tasks such as transcription or even translation) much easier. However, *what* and *how* speech-relevant information is encoded in these representations is not well understood. I will talk about some work at various stages of completion in which my group is analyzing the structure of these representations, to gain a more systematic understanding of how word-level, phonetic, and speaker information is encoded.
Sharon Goldwater is a Professor in the Institute for Language, Cognition and Computation at the University of Edinburgh’s School of Informatics. She received her PhD in 2007 from Brown University and spent two years as a postdoctoral researcher at Stanford University before moving to Edinburgh. Her research interests include unsupervised and minimally-supervised learning for speech and language processing, computer modelling of language acquisition in children, and computational studies of language use. Her main focus within linguistics has been on the lower levels of structure including phonetics, phonology, and morphology.Prof. Goldwater has received awards including the 2016 Roger Needham Award from the British Computer Society for “distinguished research contribution in computer science by a UK-based researcher who has completed up to 10 years of post-doctoral research.” She has served on the editorial boards of several journals, including Computational Linguistics, Transactions of the Association for Computational Linguistics, and the inaugural board of OPEN MIND: Advances in Cognitive Science. She was a program chair for the EACL 2014 Conference and chaired the EACL governing board from 2019-2020.
In this talk, I will present a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training.
Florian Metze is a Research Scientist Manager at Meta AI in New York, supporting a team of researchers and engineers working on multi-modal (image, video, audio, text) content understanding for Meta’s Family of Apps (Instagram, Threads, Facebook, WhatsApp). He used to be an Associate Research Professor at Carnegie Mellon University, in the School of Computer Science’s Language Technologies Institute, where he still is an Adjunct Professor. He is also a co-founder of Abridge, a company working on extracting information from doctor patient conversations. His work covers many areas of speech recognition and multi-media analysis with a focus on end-to-end deep learning. Currently, he focuses on multi-modal processing of videos, and using that information to recommend unconnected content. In the past, he has worked on low resource and multi-lingual speech processing, speech recognition with articulatory features, large-scale multi-media retrieval and summarization, information extraction from medical interviews, and recognition of personality or similar meta-data from speech.
For more information, please see http://www.cs.cmu.edu/directory/fmetze