Seminars

Jan
23
Mon
CLSP Student Seminar – Kelly Marchisio – “Efficient Multilingual NLP” @ Hackerman Hall B17
Jan 23 @ 12:00 pm – 1:15 pm

Abstract

Kelly’s research spans three broad directions in multilingual NLP and representation learning: (1) diagnosing and fixing failure modes in translation technologies (2) data-efficient and low-resource NLP, and (3) compute-efficient NLP. This talk is an overview of 5 years of PhD work, spanning projects on unsupervised machine translation and bilingual lexicon induction, the mathematical framing of translation tasks, and efficient adaptation of large language models to new languages.  Kelly will also discuss future research directions, including multi-modal representation learning, compression, speech translation, and sign-language translation.

Jan
27
Fri
CLSP Student Seminar @ Hackerman Hall B17
Jan 27 @ 12:00 pm – 1:15 pm
Feb
13
Mon
CLSP Student Seminar @ Hackerman Hall B17
Feb 13 @ 12:00 pm – 1:15 pm
Mar
27
Mon
Student Seminar – Desh Raj @ Hackerman Hall B17
Mar 27 @ 12:00 pm – 1:15 pm
Apr
3
Mon
Student Seminar – Samik Sadhu (JHU) “Importance of Different Temporal Modulations of Speech: A Tale of Two Perspectives” @ Hackerman Hall B17
Apr 3 @ 12:00 pm – 1:15 pm

Abstract

How important are different temporal speech modulations for speech recognition? We answer this question from two complementary perspectives. Firstly, we quantify the amount of phonetic information in the modulation spectrum of speech by computing the mutual information between temporal modulations with frame-wise phoneme labels. Looking from another perspective, we ask – which speech modulations an Automatic Speech Recognition (ASR) system prefers for its operation. Data-driven weights are learned over the modulation spectrum and optimized for an end-to-end ASR task. Both methods unanimously agree that speech information is mostly contained in slow modulation. Maximum mutual information occurs around 3-6 Hz which also happens to be the range of modulations most preferred by the ASR. In addition, we show that the incorporation of this knowledge into ASRs significantly reduces their dependency on the amount of training data.

 

Apr
10
Mon
Student Seminar – Ruizhe Huang @ Hackerman Hall B17
Apr 10 @ 12:00 pm – 1:15 pm
Apr
24
Mon
Student Seminar – Brian Lu @ Hackerman Hall B17
Apr 24 @ 12:00 pm – 1:15 pm
Sep
11
Mon
Student Seminar – Guanghui Qin “Nugget: Neural Agglomerative Embeddings of Text (ICML 2023)” @ Hackerman Hall B17
Sep 11 @ 12:00 pm – 1:15 pm

Abstract

Embedding text sequences is a widespread requirement in modern language understanding. Existing approaches focus largely on constant-size representations. This is problematic, as the amount of information contained in text often varies with the length of the input. We propose a solution called Nugget, which encodes language into a representation based on a dynamically selected subset of input tokens. These nuggets are learned through tasks like autoencoding and machine translation, and intuitively segment language into meaningful units. We demonstrate Nugget outperforms related approaches in tasks involving semantic comparison. Finally, we illustrate these compact units allow for expanding the contextual window of a language model (LM), suggesting new future LMs that can condition on significantly larger amounts of content.

Sep
29
Fri
CLSP Student Seminar – Zhengping Jiang “Scalar Labels for Capturing Human Uncertainty” @ Hackerman Hall B17
Sep 29 @ 12:00 pm – 1:15 pm

Abstract

Any valuable NLP dataset has traditionally been shipped with crowdsourced categorical labels. Instructions for collecting these labels are easy to communicate and the labels themselves are easy to annotate. However, as self-supervision based methods are getting better at basically everything, human annotations may need to provide more nuanced supervision or enable more detailed evaluation in order to be worth further collecting. One natural extension to existing categorical annotation schemes is to obtain uncertainty information beyond a single hard label. In this talk, I will discuss my recent efforts on introducing scalar labels in place of categorical labels as a form of uncertainty annotation. We demonstrate that, compared to other more obvious annotation schemes for eliciting uncertainty information, scalar labels are significantly more cost-effective to annotate, provide reliable evaluation, and have a theoretical connection to existing predictive uncertainty metrics. In particular, they motivate using other losses as surrogates for calibration evaluation.

Oct
2
Mon
CLSP Student Seminar – Anna Favaro @ Hackerman Hall B17
Oct 2 @ 12:00 pm – 1:15 pm

Center for Language and Speech Processing