Shane Bergsma Shane Bergsma
  Shane Bergsma
   Postdoctoral Fellow
   On Twitter: @ShaneBergsma

    Johns Hopkins University
Center for Language and Speech Processing
Human Language Technology Center of Excellence
Natural Sciences and Engineering Research Council of Canada
Postdoctoral Fellowship Recipient


I have moved to the University of Saskatchewan. My new homepage is:

Advice - Teaching - Presentations - Research - Publications - Workshops - Data/Code - Contact

Research: Natural Language Processing (NLP)
    The field of natural language processing (also known as computational linguistics) aims to develop computer systems that are capable of understanding and responding to human language. NLP technology provides the brains behind IBM's Jeopardy-playing Watson system. NLP also provides core algorithms for systems that perform spelling correction, speech recognition, and automatic translation; such systems are used by millions of people every day. My research in NLP is based on the idea that robust processing of human language requires knowledge beyond what is in small human-annotated data sets. I therefore explore ways to extract meaningful statistics from huge amounts of raw text, and I use these statistics to create intelligent language-processing systems. I also look at ways to extract information from large-scale bilingual and visual data, and I apply this information to linguistic problems. Techniques from machine learning play a central role in my work; machine learning provides principled ways to combine linguistic intuitions with evidence from real-world data. See my publications below for more details.


Presentation Materials:
    You can find slides for most of my conference presentations below with the corresponding publication. In addition, I also provide here the presentation materials for some recent invited talks and other presentations:
  • Better Together: Large Monolingual, Bilingual and Multimodal Corpora in Natural Language Processing, 2011 talks at Cambridge University, University of Pennsylvania (intended for an NLP audience). Slides in [pptx] [ppt] [pdf].
  • Three kinds of web data that can help computers make better sense of human language, Fall 2011 talks at York University, University of Saskatchewan, Stony Brook University (intended for general Computer Science audience). Slides in [pptx] [ppt] [pdf].
  • Coreference Resolution using Web-Scale Statistics, most recently a Fall 2011 lecture at Stony Brook University (intended for an NLP audience). Slides in [pptx] [ppt] [pdf].











JHU Research Workshops:

  • Software Projects:
    1. ArcFilter: An efficient program that vastly speeds up arc-based dependency parsing. It filters arcs from the dependency graph before parsing begins. Used in our recent COLING and ACL papers. [@GoogleCode]
    2. NADA: A robust program for detecting non-referential (a.k.a. pleonastic, expletive, dummy) pronouns. It takes tokenized English sentences as input and finds occurrences of the word 'it'. When an 'it' is found, the system outputs a probability for whether the 'it' is a referential instance, or instead a non-referential pronoun. Described in our DAARC 2011 paper. [@GoogleCode]
    3. Carmen: A Twitter Geolocation System. "Given a tweet, Carmen will return Location objects that represent a physical location. Carmen uses both coordinates and other information in a tweet to make geolocation decisions. It's not perfect, but this greatly increases the number of geolocated tweets over what Twitter provides." Described in our HIAI paper. [@GitHub]

  • Generally Useful NLP Data:
    1. Noun Gender and Number Data for Coreference Resolution. My most widely-used data, one of the standard resources in the Closed Task for the CoNLL 2011 Shared Task on Modeling Unrestricted Coreference in OntoNotes. Your coreference system should probably make use of it too! [GenderData]
    2. Distributional Clustering of Phrases: A clustering of a huge number of phrases from Google N-grams. [Clusters]

  • Training and Evaluation Code/Data:
    1. *Manually-Annotated Data for Language Identification in Twitter along with a Python-based language-ID system [Tweets]
    2. *Manually-Segmented Search Engine Queries and Feature Data. This query data has become a standard evaluation set for Information Retrieval research. [Queries]
    3. Annotated and processed ACL articles used in our work on Stylometric Analysis of Scientific Articles. [Labeled ACL Papers]
    4. Evaluation code and data for Learning Bilingual Lexicons from the visual similarity of Web Images. [Visual Lexicon Materials]
    5. Evaluation code and data for our Coordination Disambiguation project. [Coordination Materials]
    6. Evaluation code and data for our Visual Selectional Preference project. [Visual Selectional Preference Materials]
    7. Evaluation data for our Robust Supervised Classifiers project. [Robust Data]
    8. It-Bank: An online repository of labelled instances of the pronoun "it": [It-Bank]
    9. American National Corpus articles with Annotated Anaphora Resolutions: [Annotated Anaphora Data]
    10. Evaluation data used in our Alignment-Based Discriminative String Similarity project. [Cognates]



Dr. Shane Bergsma
Center for Language and Speech Processing
Johns Hopkins University
3400 N. Charles Street, CSEB 226E
Baltimore, MD 21218-2691 U.S.A.