Modeling Lexically Divergent Paraphrases in Twitter and Beyond – Wei Xu (University of Pennsylvania)

February 24, 2015 all-day

Why is it so difficult for computers to understand and generate natural language? The main challenge arises from the fact that human language is both rich and ambiguous. One way to handle this richness and ambiguity is to learn paraphrases – language expressions that are worded differently but have (nearly) equivalent meanings – from massive amounts of data. First, I will present a novel multi-instance learning model that captures a wide range of paraphrases from Twitter’s data stream, including synonyms, acronyms, misspellings, slang and colloquialisms (e.g. “has been sacked by” “gets the boot from”, “oscar nom’d doc” “Oscar-nominated documentary”). I will highlight the utility of paraphrases to adapt statistical machine translation techniques for text-to-text generation tasks like text simplification or stylistic rewriting. Second, I will describe how similar models can be used for distantly supervised information extraction, which leverages large knowledge bases instead of using human-labeled text data during learning.
Joint work with Chris Callison-Burch (UPenn), Ralph Grishman (NYU), Alan Ritter (OSU), Bill Dolan (MSR), Raphael Hoffmann (AI2), Joel Tetreault (Yahoo!), Le Zhao (Google), Martin Chodorow (CUNY), Yangfeng Ji (GaTech), Colin Cherry (NRC).

Wei Xu is a postdoctoral researcher in Computer and Information Science Department at University of Pennsylvania ( Her research focuses on paraphrases, social media and information extraction. She received her PhD in Computer Science from New York University. During her PhD, she visited University of Washington for two years and interned at Microsoft Research, ETS and She is organizing the SemEval-2015 shared task on Paraphrase and Semantic Similarity in Twitter ( and the ACL Workshop on Noisy User-generated Text (

Center for Language and Speech Processing