Model robustness and spurious correlations have received increasing attention in the NLP community, both in methods and evaluation. The term “spurious correlation” is overloaded though and can refer to any undesirable shortcuts learned by the model, as judged by domain experts.
When designing mitigation algorithms, we often (implicitly) assume that a spurious feature is irrelevant for prediction. However, many features in NLP (e.g. word overlap and negation) are not spurious in the sense that the background is spurious for classifying objects in an image. In contrast, they carry important information that’s needed to make predictions by humans. In this talk, we argue that it is more productive to characterize features in terms of their necessity and sufficiency for prediction. We then discuss the implications of this categorization in representation, learning, and evaluation.
He He is an Assistant Professor in the Department of Computer Science and the Center for Data Science at New York University. She obtained her PhD in Computer Science at the University of Maryland, College Park. Before joining NYU, she spent a year at AWS AI and was a post-doc at Stanford University before that. She is interested in building robust and trustworthy NLP systems in human-centered settings. Her recent research focus includes robust language understanding, collaborative text generation, and understanding capabilities and issues of large language models.
Transformers are essential to pretraining. As we approach 5 years of BERT, the connection between attention as architecture and transfer learning remains key to this central thread in NLP. Other architectures such as CNNs and RNNs have been used to replicate pretraining results, but these either fail to reach the same accuracy or require supplemental attention layers. This work revisits the semanal BERT result and considers pretraining without attention. We consider replacing self-attention layers with recently developed approach for long-range sequence modeling and transformer architecture variants. Specifically, inspired by recent papers like the structured space space sequence model (S4), we use simple routing layers based on state-space models (SSM) and a bidirectional model architecture based on multiplicative gating. We discuss the results of the proposed Bidirectional Gated SSM (BiGS) and present a range of analysis into its properties. Results show that architecture does seem to have a notable impact on downstream performance and a different inductive bias that is worth exploring further.