Matt Gardner (Allen Institute for Artificial Intelligence) “NLP Evaluations that We Believe In”
3400 N. Charles Street
With all of the modeling advancements in recent years, NLP benchmarks have been falling over left and right: “human performance” has been reached on SQuAD 1 and 2, GLUE and SuperGLUE, and many commonsense datasets. Yet no serious researcher actually believes that these systems understand language, or even really solve the underlying tasks behind these datasets. To get benchmarks that we actually believe in, we need to both think more deeply about the language phenomena that our benchmarks are targeting, and make our evaluation sets more rigorous. I will first present ORB, an Open Reading Benchmark that collects many reading comprehension datasets that we (and others) have recently built, targeting various aspects of what it means to read. I will then present contrast sets, a way of creating non-iid test sets that more thoroughly evaluate a model’s abilities on some task, decoupling training data artifacts from test labels.
Matt is a senior research scientist at the Allen Institute for AI on the AllenNLP team. His research focuses primarily on getting computers to read and answer questions, dealing both with open domain reading comprehension and with understanding question semantics in terms of some formal grounding (semantic parsing). He is particularly interested in cases where these two problems intersect, doing some kind of reasoning over open domain text. He is the original architect of the AllenNLP toolkit, and he co-hosts the NLP Highlights podcast with Waleed Ammar and Pradeep Dasigi.