Paper that asks the right questions earns Shijie Wu and Mark Dredze a Best Long Paper Award

July 23, 2020

A paper co-written by Shijie Wu, a PhD candidate in the Center for Language and Speech Processing, and Department of Computer Science professor Mark Dredze, recently received the Best Long Paper Award at the Workshop on Representation Learning for Natural Language Processing, which took place virtually on July 9.

Wu referred to winning the award as a surprise, especially because the primary focus of the paper is on asking the right questions.

“I think asking the right question is as important as solving a question,” Wu said. “While some might think our field focuses more and more on climbing whatever the next hill, I am glad to find that people still appreciate finding which hill to climb.”

Shijie Wu

The paper, “Are All Languages Created Equal in Multilingual BERT?”, focuses on Bidirectional Encoder Representations from Transformers (BERT), a natural language processing model that Google created in 2018. The first iteration of BERT was pretrained to fill in random missing words in a sentence, similar to that of a cloze test, by reading through the entirety of Wikipedia’s English articles, as well as a number of books.

After the success of the first BERT, Google created a second, multilingual version that was pretrained on Wikipedia articles in 104 languages. This more agile version can encode words and models in one language and then, with no input from humans, directly apply that model to other languages, greatly increasing NLP technology’s capability.

Wu and Dredze ran a series of experiments to ascertain whether Google’s multilingual BERT could learn low-resource – languages that did not have much source material – as effectively as could languages that had a large number of resources for the model to pretrain with. The duo discovered that the multilingual BERT struggled with languages in the bottom 30 percent (measured in terms of number of resources) more so than did a non-BERT model, such as a word embeddings based model like FastText.

“We show that pretraining alone is not the silver bullet for low-resource languages,” Wu said. “This work sheds light on the limitation of the popular multilingual BERT. If you want to use BERT for low-resource languages, you will need to either collect more text to make it a high-resource language, or come up with more data-efficient methods.”

Wu hopes that this research will lead to NLP researchers to value lower-resource language in their work. He cites that, with more than 6,000 languages being spoken around the world, as well as the Bible being translated to over 2,500, that there should be a greater emphasis on lesser-spoken languages among NLP researchers.

“I think NLP technology should be able to serve any language in the world,” Wu said. “I start by looking at recent progress in NLP through the lens of low-resource languages, which are still the top 100 languages in the world in terms of resources. For these languages, the recent progress in NLP seems less significant.”

Follow this link to read the paper.


Center for Language and Speech Processing