Abstract
Kelly’s research spans three broad directions in multilingual NLP and representation learning: (1) diagnosing and fixing failure modes in translation technologies (2) data-efficient and low-resource NLP, and (3) compute-efficient NLP. This talk is an overview of 5 years of PhD work, spanning projects on unsupervised machine translation and bilingual lexicon induction, the mathematical framing of translation tasks, and efficient adaptation of large language models to new languages. Kelly will also discuss future research directions, including multi-modal representation learning, compression, speech translation, and sign-language translation.
Abstract
How important are different temporal speech modulations for speech recognition? We answer this question from two complementary perspectives. Firstly, we quantify the amount of phonetic information in the modulation spectrum of speech by computing the mutual information between temporal modulations with frame-wise phoneme labels. Looking from another perspective, we ask – which speech modulations an Automatic Speech Recognition (ASR) system prefers for its operation. Data-driven weights are learned over the modulation spectrum and optimized for an end-to-end ASR task. Both methods unanimously agree that speech information is mostly contained in slow modulation. Maximum mutual information occurs around 3-6 Hz which also happens to be the range of modulations most preferred by the ASR. In addition, we show that the incorporation of this knowledge into ASRs significantly reduces their dependency on the amount of training data.
Abstract
Embedding text sequences is a widespread requirement in modern language understanding. Existing approaches focus largely on constant-size representations. This is problematic, as the amount of information contained in text often varies with the length of the input. We propose a solution called Nugget, which encodes language into a representation based on a dynamically selected subset of input tokens. These nuggets are learned through tasks like autoencoding and machine translation, and intuitively segment language into meaningful units. We demonstrate Nugget outperforms related approaches in tasks involving semantic comparison. Finally, we illustrate these compact units allow for expanding the contextual window of a language model (LM), suggesting new future LMs that can condition on significantly larger amounts of content.
Abstract
Any valuable NLP dataset has traditionally been shipped with crowdsourced categorical labels. Instructions for collecting these labels are easy to communicate and the labels themselves are easy to annotate. However, as self-supervision based methods are getting better at basically everything, human annotations may need to provide more nuanced supervision or enable more detailed evaluation in order to be worth further collecting. One natural extension to existing categorical annotation schemes is to obtain uncertainty information beyond a single hard label. In this talk, I will discuss my recent efforts on introducing scalar labels in place of categorical labels as a form of uncertainty annotation. We demonstrate that, compared to other more obvious annotation schemes for eliciting uncertainty information, scalar labels are significantly more cost-effective to annotate, provide reliable evaluation, and have a theoretical connection to existing predictive uncertainty metrics. In particular, they motivate using other losses as surrogates for calibration evaluation.