Improving Translation of Informal Language

2019 Sixth Frederick Jelinek Memorial Summer Workshop

Natural Language Processing has made significant strides in a number of applications such as Machine Translation, Question Answering, and Text Classification. NLP applications, however, still have difficulty dealing with “non-standard” text. For example, recently Google Translate was reported to “hallucinate” religious prophecies when typing “dog” 19 times [14]. Compared to traditional sources such as news corpora, the user-generated text brings a set of unique challenges. These include the use of informal language (e.g. hashtags and slang) as well as a long tail of variations such as misspellings, typos and creative use of language. We need to teach NLP systems to deal with these diverse linguistic phenomena in order to perform well under typical operating conditions [12].

In the specific case of neural machine translation (NMT), recent work has claimed to achieve results near-human performance [3, 10]. These results cover translations on formal domains such as news articles. When they are evaluated under noisy or adversarial conditions, these state-of-the-art systems often fail [6, 11], which suggests there exists intrinsic weakness of seq2seq models [12, 13]. In order to use translation to facilitate conversations that cross language barriers, we need to address these challenges. Examples, where translation systems can be used successfully for informal text, include messaging applications (Messenger, Whatsapp, iMessage), content sharing on social media (Facebook, Instagram, Twitter), and discussion forums (Reddit) [1].

Proposed Task: Machine Translation of Informal Language

We will focus on developing machine translation systems that address the following research questions.

  • How can we make MT systems robust to lexical variations due to social media style, for example, the usage of ”dunno” or ”idk” instead of ”I do not know”, or to dialect input, for example, ”colour” vs. ”color”?
  • How can we make MT systems robust to spelling variations such as typos, phonological variations (for example ”mayb” instead of ”maybe”), and other non-standard spellings?
  • Can we take into account additional context of the input? Rich context includes sociolinguistic features of author attribute, non-linguistic features such as emojis, hashtags, etc. In addition, for conversation use cases such as a thread of messages or comments, can we make MT systems leverage more context in the thread?

We will research new methods to address these challenges during the workshop. This will require modeling and algorithmic expertise in building encoders that are robust to language variations and building decoders that can condition on context. Building relevant datasets prior to the workshop will require crowdsourcing and annotation expertise to obtain translations that capture the social meaning of the original text. Finally, dialogue modeling expertise will allow to track changes in dialogue state and determine how they might affect the translation.

Expected Outcomes

As part of the workshop, we expect to deliver: (i) novel methods for dealing with informal, noisy text to produce more accurate translations and (ii) novel methods to utilize context when translating, specifically when translating an informal conversational text.



Team Leader

Dmitriy Genzel (Facebook)

Senior Members

Xian Li  (Facebook)
Jackie Cheung (McGill, Canada)
Jia Xu (City U. of New York)

Graduate Students

Yue Dong (McGill, CA)
Pippa Shoemark (Edinburgh, UK)
Abdul Rafae Khan (City U. of New York)
Paul Michel (Carnegie Mellon U.)

Elizabeth Salesky (Johns Hopkins U.)

Senior Affiliates (Part-time Members)

James Cross (Facebook)
Graham Neubig (Carnegie Mellon U.)
Mahsa Mohammadi (Johns Hopkins U.)
Hairong Liu (Baidu, China)
Jacob Eisenstein (Georgia Tech/FB)
Lucia Specia (Imperial College London/FB)
Philipp Koehn (Johns Hopkins U.)
Tobias Domhan (Amazon)
Ves Stoyanov (Facebook)
Yi-Hisu Liao (Apple)



Center for Language and Speech Processing