Publications


Loading...

2014

The American Local News Corpus
Ann Irvine, Joshua Langfus and Chris Callison-Burch
Proceedings of the Language Resources and Evaluation Conference (LREC) – 2014

[abstract] [bib]

Abstract

We present the American Local News Corpus (ALNC), containing over 4 billion words of text from 2,652 online newspapers in the United States. Each article in the corpus is associated with a timestamp, state, and city. All 50 U.S. states and 1,924 cities are represented. We detail our method for taking daily snapshots of thousands of local and national newspapers and present two example corpus analyses. The first explores how different sports are talked about over time and geography. The second compares per capita murder rates with news coverage of murders across the 50 states. The ALNC is about the same size as the Gigaword corpus and is growing continuously. Version 1.0 is available for research use.
@InProceedings{irvine-etal-lrec14,
author = {Irvine, Ann and Joshua Langfus and Callison-Burch, Chris},
title = {The American Local News Corpus},
booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)},
url = {http://www.cs.jhu.edu/~anni/papers/alnc_lrec14.pdf}
}

The Language Demographics of Amazon Mechanical Turk
Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev and Chris Callison-Burch
Transactions of the Association for Computational Linguistics (TACL) – 2014

[abstract] [bib]

Abstract

We present a large scale study of the languages spoken by bilingual workers on Mechanical Turk (MTurk). We establish a methodology for determining the language skills of anonymous crowd workers that is more robust than simple surveying. We validate workers' self-reported language skill claims by measuring their ability to correctly translate words, and by geolocating workers to see if they reside in countries where the languages are likely to be spoken. Rather than posting a one-off survey, we posted paid tasks consisting of 1,000 assignments to translate a total of 10,000 words in each of 100 languages. Our study ran for several months, and was highly visible on the MTurk crowdsourcing platform, increasing the chances that bilingual workers would complete it. Our study was useful both to create bilingual dictionaries and to act as census of the bilingual speakers on MTurk. We use this data to recommend languages with the largest speaker populations as good candidates for other researchers who want to develop crowdsourced, multilingual technologies. To further demonstrate the value of creating data via crowdsourcing, we hire workers to create bilingual parallel corpora in six Indian languages, and use them to train statistical machine translation systems.
@article{Pavlick-EtAl-2014,
author = {Ellie Pavlick and Post, Matt and Irvine, Ann and Dmitry Kachaev and Callison-Burch, Chris},
title = {The Language Demographics of Amazon Mechanical Turk},
booktitle = {Transactions of the Association for Computational Linguistics (TACL)},
publisher = {Association for Computational Linguistics},
url = {http://cs.jhu.edu/~ccb/publications/language-demographics-of-mechanical-turk.pdf}
}

improving deep neural network acoustic models using generalized maxout networks
Xiaohui Zhang, Jan Trmal, Daniel Povey and Sanjeev Khudanpur
ICASSP2014 – 2014

[abstract] [bib]

Abstract

Recently, maxout networks have brought significant improvements to various speech recognition and computer vision tasks. In this pa- per we introduce two new types of generalized maxout units, which we call p-norm and soft-maxout. We investigate their performance in Large Vocabulary Continuous Speech Recognition (LVCSR) tasks in various languages with 10 hours and 60 hours of data, and find that the p-norm generalization of maxout consistently performs well. Because, in our training setup, we sometimes see instability dur- ing training when training unbounded-output nonlinearities such as these, we also present a method to control that instability. This is the "normalization layer", which is a nonlinearity that scales down all dimensions of its input in order to stop the average squared output from exceeding one. The performance of our proposed nonlinear- ities are compared with maxout, rectified linear units (ReLU), tanh units, and also with a discriminatively trained SGMM/HMM system, and our p-norm units with p equal to 2 are found to perform best.
@inproceedings{Zhan1405:Improving,
author = {Xiaohui Zhang and Jan Trmal and Povey, Daniel and Khudanpur, Sanjeev},
title = {improving deep neural network acoustic models using generalized maxout networks},
booktitle = {ICASSP2014},
address = {Florence, Italy},
url = {http://www.danielpovey.com/files/2014_icassp_dnn.pdf}
}

Back to Top

2013

Improved speech-to-text translation with the Fisher and Callhome Translated Corpus of Spanish-English Speech
Matt Post, Gaurav Kumar, Adam Lopez, Damianos Karakos, Chris Callison-Burch and Sanjeev Khudanpur
Proceedings of the International Workshop on Spoken Language Translation (IWSLT) – International Workshop on Spoken Language Translation (IWSLT) – 2013

[abstract] [bib]

Abstract

Research into the translation of the output of automatic speech recognition (ASR) systems is hindered by the dearth of datasets developed for that explicit purpose. For Spanish--English translation, in particular, most parallel data available exists only in vastly different domains and registers. In order to support research on cross-lingual speech applications, we introduce the Fisher and Callhome Spanish-English Speech Translation Corpus, supplementing existing LDC audio and transcripts with (a) ASR 1-best, lattice, and oracle output produced by the Kaldi recognition system and (b) English translations obtained on Amazons Mechanical Turk. The result is a four-way parallel dataset of Spanish audio, transcriptions, ASR lattices, and English translations of approximately 38 hours of speech, with defined training, development, and held-out test sets. We conduct baseline machine translation experiments using models trained on the provided training data, and validate the dataset by corroborating a number of known results in the field, including the utility of in-domain (information, conversational) training data, increased performance translating lattices (instead of recognizer 1-best output), and the relationship between word error rate and BLEU score.
@InProceedings{post-improved-2013,
author = {Post, Matt and Kumar, Gaurav and Lopez, Adam and Karakos, Damianos and Callison-Burch, Chris and Khudanpur, Sanjeev},
title = {Improved speech-to-text translation with the Fisher and Callhome Translated Corpus of Spanish-English Speech},
booktitle = {Proceedings of the International Workshop on Spoken Language Translation (IWSLT)}
}

Monolingual Marginal Matching for Translation Model Adaptation
Ann Irvine, Chris Quirk and Hal Daume III
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) – 2013

[abstract] [bib]

Abstract

When using a machine translation (MT) model trained on OLD-domain parallel data to translate NEW-domain text, one major challenge is the large number of out-of-vocabulary and new-translation-sense words. We present a method to identify new translations of both known and unknown source language words that uses NEW-domain comparable document pairs. Starting with a joint distribution of source-target word pairs derived from the OLD-domain parallel corpus, our method recovers a new joint distribution that matches the marginal distributions of the NEW-domain comparable document pairs, while minimizing the divergence from the OLD-domain distribution. Adding these learned translations to our French-English MT model results in gains of about 2 BLEU points over strong baselines.
@InProceedings{irvineQuirkDaumeEMNLP13,
author = {Irvine, Ann and Chris Quirk and Hal Daume III},
title = {Monolingual Marginal Matching for Translation Model Adaptation},
booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
url = {http://www.aclweb.org/anthology/D/D13/D13-1109.pdf}
}

Measuring Machine Translation Errors in New Domains
Ann Irvine, John Morgan, Marine Carpuat, Hal Daume III and Dragos Munteanu
Transactions of the Association for Computational Linguistics (TACL) – 2013

[abstract] [bib]

Abstract

We develop two techniques for analyzing the effect of porting a machine translation system to a new domain. One is a macro-level analysis that measures how domain shift affects corpus-level evaluation; the second is a micro-level analysis for word-level errors. We apply these methods to understand what happens when a Parliament-trained phrase-based machine translation system is applied in four very different domains: news, medical texts, scientific articles and movie subtitles. We present quantitative and qualitative experiments that highlight opportunities for future research in domain adaptation for machine translation.
@article{mtDomainErrors_TACL:2013,
author = {Irvine, Ann and John Morgan and Marine Carpuat and Hal Daume III and Dragos Munteanu},
title = {Measuring Machine Translation Errors in New Domains},
booktitle = {Transactions of the Association for Computational Linguistics (TACL)},
url = {https://aclweb.org/anthology/Q/Q13/Q13-1035.pdf}
}

SenseSpotting: Never let your parallel data tie you to an old domain
Marine Carpuat, Hal Daume III, Katharine Henry, Ann Irvine, Jagadeesh Jagarlamudi and Rachel Rudinger
Proceedings of the Association for Computational Linguistics (ACL) – 2013

[abstract] [bib]

Abstract

Words often gain new senses in new domains. Being able to automatically identify, from a corpus of monolingual text, which word tokens are being used in a previously unseen sense has applications to machine translation and other tasks sensitive to lexical semantics. We define a task, SENSESPOTTING, in which we build systems to spot tokens that have new senses in new domain text. Instead of difficult and expensive annotation, we build a gold-standard by leveraging cheaply available parallel corpora, targeting our approach to the problem of domain adaptation for machine translation. Our system is able to achieve F-measures of as much as 80%, when applied to word types it has never seen before. Our approach is based on a large set of novel features that capture varied aspects of how words change when used in new domains.
@InProceedings{sensespotting13,
author = {Marine Carpuat and Hal Daume III and Katharine Henry and Irvine, Ann and Jagadeesh Jagarlamudi and Rachel Rudinger},
title = {SenseSpotting: Never let your parallel data tie you to an old domain},
booktitle = {Proceedings of the Association for Computational Linguistics (ACL)},
url = {http://www.aclweb.org/anthology/P/P13/P13-1141.pdf}
}

Combining Bilingual and Comparable Corpora for Low Resource Machine Translation
Ann Irvine and Chris Callison-Burch
Proceedings of the ACL Workshop on Statistical Machine Translation (WMT) – 2013

[abstract] [bib]

Abstract

Statistical machine translation (SMT) performance suffers when models are trained on only small amounts of parallel data. The learned models typically have both low accuracy (incorrect translations and feature scores) and low coverage (high out-of-vocabulary rates). In this work, we use an additional data resource, comparable corpora, to improve both. Beginning with a small bitext and corresponding phrase-based SMT model, we improve coverage by using bilingual lexicon induction techniques to learn new translations from comparable corpora. Then, we supplement the models feature space with translation scores estimated over comparable corpora in order to improve accuracy. We observe improvements between 0.5 and 1.7 BLEU translating Tamil, Telugu, Bengali, Malayalam, Hindi, and Urdu into English.
@inProceedings{irvineCallisonBurchWMT13,
author = {Irvine, Ann and Callison-Burch, Chris},
title = {Combining Bilingual and Comparable Corpora for Low Resource Machine Translation},
booktitle = {Proceedings of the ACL Workshop on Statistical Machine Translation (WMT)},
url = {http://www.cs.jhu.edu/~anni/papers/irvineCCB_WMT13.pdf}
}

Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals
Ann Irvine and Chris Callison-Burch
Proceedings of the North American Association for Computational Linguistics (NAACL) – 2013

[abstract] [bib]

Abstract

Prior research into learning translations from monolingual texts has treated the task as an unsupervised learning problem. Although many techniques take advantage of a seed bilingual lexicon, this work is the first to use that data for supervised learning to combine a diverse set of monolingual signals into a single discriminative model. Even in a low resource machine translation setting, where induced translations have the potential to improve performance substantially, it is reasonable to assume access to some amount of data to perform this kind of optimization. We report bilingual lexicon induction accuracies that are on average nearly 50% higher than an unsupervised baseline. Large gains in accuracy hold for all 22 languages (low and high resource) that we investigate.
@InProceedings{irvineCallisonBurch13,
author = {Irvine, Ann and Callison-Burch, Chris},
title = {Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals},
booktitle = {Proceedings of the North American Association for Computational Linguistics (NAACL)},
url = {http://www.cs.jhu.edu/~anni/papers/irvineCallisonBuch-NAACL2013.pdf}
}

Statistical Machine Translation in Low Resource Settings
Ann Irvine
Proceedings of the NAACL Student Research Workshop – 2013

[bib]

@InProceedings{irvineNAACLSRW13,
author = {Irvine, Ann},
title = {Statistical Machine Translation in Low Resource Settings},
booktitle = {Proceedings of the NAACL Student Research Workshop}
}

Quantifying the Value of Pronunciation Lexicons for Keyword Search in Low Resource Languages
Guoguo Chen, Sanjeev Khudanpur, Daniel Povey, Jan Trmal, David Yarowsky and Oguz Yilmaz
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on – 2013

Tags: Speech Recognition, Keyword Search, Information Retrieval, Morphology, Speech Synthesis  |  [bib]

@inproceedings{chen2013quantifying,
author = {Chen, Guoguo and Khudanpur, Sanjeev and Povey, Daniel and Jan Trmal and Yarowsky, David and Yilmaz, Oguz},
title = {Quantifying the Value of Pronunciation Lexicons for Keyword Search in Low Resource Languages},
booktitle = {Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on},
pages = {8560--8564},
url = {http://old-site.clsp.jhu.edu/~guoguo/papers/chen2013quantifying.pdf}
}

The (Un)faithful Machine Translator
Ruth Jones and Ann Irvine
ACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) – 2013

[bib]

@inProceedings{jonesIrvine,
author = {Ruth Jones and Irvine, Ann},
title = {The (Un)faithful Machine Translator},
booktitle = {ACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)},
url = {http://www.cs.jhu.edu/~anni/papers/jonesIrvineTranslation.pdf}
}

A framework for (under)specifying dependency syntax without overloading annotators
Nathan Schneider, Brendan O'Connor, Naomi P. Saphra, David Bamman, Manaal Faruqui, Noah Smith, Chris Dyer and Jason Baldridge
CoRR – 2013

[bib]

@article{DBLP:journals/corr/SchneiderOSBFSDB13,
author = {Nathan Schneider and Brendan O'Connor and Saphra, Naomi and David Bamman and Manaal Faruqui and Noah Smith and Chris Dyer and Jason Baldridge},
title = {A framework for (under)specifying dependency syntax without overloading annotators}
}

Using Proxies for OOV Keywords in the Keyword Search Task
Guoguo Chen, Oguz Yilmaz, Jan Trmal, Daniel Povey and Sanjeev Khudanpur
Proceedings of ASRU 2013 – 2013

Tags: Speech Recognition, Keyword Search, OOV Keywords, Proxy Keywords, Low Resource LVCSR  |  [bib]

@inproceedings{chen2013using,
author = {Chen, Guoguo and Yilmaz, Oguz and Jan Trmal and Povey, Daniel and Khudanpur, Sanjeev},
title = {Using Proxies for OOV Keywords in the Keyword Search Task},
booktitle = {Proceedings of ASRU 2013},
url = {http://old-site.clsp.jhu.edu/~guoguo/papers/chen2013using.pdf}
}

Fixed-Dimensional Acoustic Embeddings of Variable-Length Segments in Low-Resource Settings
Keith Levin, Katharine Henry, Aren Jansen and Karen Livescu
ASRU – 2013

[bib]

@inproceedings{Levin2013,
author = {Levin, Keith and Henry, Katharine and Jansen, Aren and Karen Livescu},
title = {Fixed-Dimensional Acoustic Embeddings of Variable-Length Segments in Low-Resource Settings},
booktitle = {ASRU}
}

Back to Top

2012

A Flexible Solver for Finite Arithmetic Circuits
Nathaniel Filardo and Jason Eisner
Technical Communications of the 28th International Conference on Logic Programming, ICLP 2012 – 2012

[abstract] [bib]

Abstract

Arithmetic circuits arise in the context of weighted logic programming languages, such as Datalog with aggregation, or Dyna. A weighted logic program defines a generalized arithmetic circuit—the weighted version of a proof forest, with nodes having arbitrary rather than boolean values. In this paper, we focus on finite circuits. We present a flexible algorithm for efficiently querying node values as they change under updates to the circuit's inputs. Unlike traditional algorithms, ours is agnostic about which nodes are tabled (materialized), and can vary smoothly between the traditional strategies of forward and backward chaining. Our algorithm is designed to admit future generalizations, including cyclic and infinite circuits and propagation of delta updates.
@inproceedings{filardo-eisner-2012-iclp,
author = {Filardo, Nathaniel and Eisner, Jason},
title = {A Flexible Solver for Finite Arithmetic Circuits},
booktitle = {Technical Communications of the 28th International Conference on Logic Programming, ICLP 2012},
url = {http://cs.jhu.edu/~jason/papers/#iclp12}
}

MAP Estimation of Whole-Word Acoustic Models with Dictionary Priors
Keith Kintzley, Aren Jansen and Hynek Hermansky
Proc. of INTERSPEECH – 2012

[abstract] [bib]

Abstract

The intrinsic advantages of whole-word acoustic modeling are offset by the problem of data sparsity. To address this, we present several parametric approaches to estimating intra-word phonetic timing models under the assumption that relative timing is independent of word duration. We show evidence that the timing of phonetic events is well described by the Gaussian distribution. We explore the construction of models in the absence of keyword examples (dictionary-based), when keyword examples are abundant (Gaussian mixture models), and also present a Bayesian approach which unifies the two. Applying these techniques in a point process model keyword spotting framework, we demonstrate a 55\% relative improvement in performance for models constructed from few examples.
@InProceedings{kintzley-jansen-hermansky:is2012a,
author = {Kintzley, Keith and Jansen, Aren and Hermansky, Hynek},
title = {MAP Estimation of Whole-Word Acoustic Models with Dictionary Priors},
url = {http://old-site.clsp.jhu.edu/~ajansen/papers/IS2012c.pdf}
}

Inverting the Point Process Model for Fast Phonetic Keyword Search
Keith Kintzley, Aren Jansen, Kenneth Church and Hynek Hermansky
Proc. of INTERSPEECH – 2012

[abstract] [bib]

Abstract

Normally, we represent speech as a long sequence of frames and model the keyword with a relatively small set of parameters, commonly with a hidden Markov model (HMM). However, since the input speech is much longer than the keyword, suppose instead that we represent the speech as a relatively sparse set of impulses (roughly one per phoneme) and model the keyword as a filter-bank where each filter's impulse response relates to the likelihood of a phone at a given position within a word. Evaluating keyword detections can then be seen as a convolution of an impulse train with an array of filters. This view enables huge speedups; runtime no longer depends on the frame rate and is instead linear in the number of events (impulses). We apply this intuition to redesign the runtime engine behind the point process model for keyword spotting. We demonstrate impressive real-time speedups (500,000x faster than real-time) with minimal loss in search accuracy.
@InProceedings{kintzley-jansen-church-hermansky:is2012b,
author = {Keith Kintzley and Jansen, Aren and Church, Kenneth and Hermansky, Hynek},
title = {Inverting the Point Process Model for Fast Phonetic Keyword Search},
address = {Portland, Oregon, USA},
publisher = {International Speech Communication Association}
}

Inverting the Point Process Model for Fast Phonetic Keyword Search
Keith Kintzley, Aren Jansen, Kenneth Church and Hynek Hermansky
Proc. of INTERSPEECH – 2012

[abstract] [bib]

Abstract

Normally, we represent speech as a long sequence of frames and model the keyword with a relatively small set of parameters, commonly with a hidden Markov model (HMM). However, since the input speech is much longer than the keyword, suppose instead that we represent the speech as a relatively sparse set of impulses (roughly one per phoneme) and model the keyword as a filter-bank where each filter's impulse response relates to the likelihood of a phone at a given position within a word. Evaluating keyword detections can then be seen as a convolution of an impulse train with an array of filters. This view enables huge speedups; runtime no longer depends on the frame rate and is instead linear in the number of events (impulses). We apply this intuition to redesign the runtime engine behind the point process model for keyword spotting. We demonstrate impressive real-time speedups (500,000x faster than real-time) with minimal loss in search accuracy.
@InProceedings{kintzley-jansen-church-hermansky:is2012b,
author = {Kintzley, Keith and Jansen, Aren and Church, Kenneth and Hermansky, Hynek},
title = {Inverting the Point Process Model for Fast Phonetic Keyword Search},
address = {Portland, Oregon, USA},
publisher = {International Speech Communication Association},
url = {http://old-site.clsp.jhu.edu/~ajansen/papers/IS2012d.pdf}
}

Name Phylogeny: A Generative Model of String Variation
Nicholas Andrews, Jason Eisner and Mark Dredze
Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) – 2012

[abstract] [bib]

Abstract

Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, "similar" strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.
@inproceedings{andrews-et-al-2012-emnlp,
author = {Andrews, Nicholas and Eisner, Jason and Dredze, Mark},
title = {Name Phylogeny: A Generative Model of String Variation},
booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)},
pages = {344--355},
url = {http://cs.jhu.edu/~jason/papers/#emnlp12}
}

Findings of the 2012 Workshop on Statistical Machine Translation
Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut and Lucia Specia
Proceedings of the Seventh Workshop on Statistical Machine Translation – 2012

[abstract] [bib]

Abstract

This paper presents the results of the WMT12 shared tasks, which included a translation task, a task for machine translation evaluation metrics, and a task for run-time estimation of machine translation quality. We conducted a large-scale manual evaluation of 103 machine translation systems submitted by 34 teams. We used the ranking of these systems to mea- sure how strongly automatic metrics correlate with human judgments of translation quality for 12 evaluation metrics. We introduced a new quality estimation task this year, and evaluated submissions from 11 teams.
@inproceedings{callisonburch-EtAl:2012:WMT,
author = {Callison-Burch, Chris and Philipp Koehn and Christof Monz and Post, Matt and Radu Soricut and Lucia Specia},
title = {Findings of the 2012 Workshop on Statistical Machine Translation},
booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},
address = {Montr\'eal, Canada},
publisher = {Association for Computational Linguistics},
pages = {10--51},
url = {http://cs.jhu.edu/~ccb/publications/findings-of-the-wmt12-shared-tasks.pdf}
}

Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing
Matt Post, Chris Callison-Burch and Miles Osborne
Proceedings of the Seventh Workshop on Statistical Machine Translation – 2012

[abstract] [bib]

Abstract

Recent work has established the efficacy of Amazon's Mechanical Turk for constructing parallel corpora for machine translation research. We apply this to building a collection of parallel corpora between English and six languages from the Indian subcontinent: Bengali, Hindi, Malayalam, Tamil, Telugu, and Urdu. These languages are low-resource, under-studied, and exhibit linguistic phenomena that are difficult for machine translation. We conduct a variety of baseline experiments and analysis, and release the data to the community.
@inproceedings{post-callisonburch-osborne:2012:WMT,
author = {Post, Matt and Callison-Burch, Chris and Miles Osborne},
title = {Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing},
booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},
address = {Montr\'eal, Canada},
publisher = {Association for Computational Linguistics},
pages = {401--409},
url = {http://www.aclweb.org/anthology/W12-3152}
}

Using Categorial Grammar to Label Translation Rules
Jonathan Weese, Chris Callison-Burch and Adam Lopez
Proceedings of the Seventh Workshop on Statistical Machine Translation – 2012

[abstract] [bib]

Abstract

Adding syntactic labels to synchronous context-free translation rules can improve performance, but labeling with phrase structure constituents, as in GHKM (Galley et al., 2004), excludes potentially useful translation rules. SAMT (Zollmann and Venugopal, 2006) introduces heuristics to create new non-constituent labels, but these heuristics introduce many complex labels and tend to add rarely-applicable rules to the translation grammar. We introduce a labeling scheme based on categorial grammar, which allows syntactic labeling of many rules with a minimal, well-motivated label set. We show that our labeling scheme performs comparably to SAMT on an UrduEnglish translation task, yet the label set is an order of magnitude smaller, and translation is twice as fast.
@inproceedings{weese-callisonburch-lopez:2012:WMT,
author = {Weese, Jonathan and Callison-Burch, Chris and Lopez, Adam},
title = {Using Categorial Grammar to Label Translation Rules},
booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},
address = {Montr\'eal, Canada},
publisher = {Association for Computational Linguistics},
pages = {222--231},
url = {http://cs.jhu.edu/~ccb/publications/using-categorial-grammar-to-label-translation-rules.pdf}
}

Joshua 4.0: Packing, PRO, and Paraphrases
Juri Ganitkevitch, Yuan Cao, Jonathan Weese, Matt Post and Chris Callison-Burch
Proceedings of the Seventh Workshop on Statistical Machine Translation – 2012

[abstract] [bib]

Abstract

We present Joshua 4.0, the newest version of our open-source decoder for parsing-based statistical machine translation. The main contributions in this release are the introduction of a compact grammar representation based on packed tries, and the integration of our implementation of pairwise ranking optimization, J-PRO. We further present the extension of the Thrax SCFG grammar extractor to pivot-based extraction of syntactically informed sentential paraphrases.
@inproceedings{ganitkevitch-EtAl:2012:WMT,
author = {Ganitkevitch, Juri and Cao, Yuan and Weese, Jonathan and Post, Matt and Callison-Burch, Chris},
title = {Joshua 4.0: Packing, PRO, and Paraphrases},
booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},
address = {Montr\'eal, Canada},
publisher = {Association for Computational Linguistics},
pages = {283--291},
url = {http://cs.jhu.edu/~ccb/publications/joshua-4.0.pdf}
}

Monolingual Distributional Similarity for Text-to-Text Generation
Juri Ganitkevitch, Benjamin Van Durme and Chris Callison-Burch
*SEM First Joint Conference on Lexical and Computational Semantics – 2012

[abstract] [bib]

Abstract

Previous work on paraphrase extraction and application has relied on either parallel datasets, or on distributional similarity metrics over large text corpora. Our approach combines these two orthogonal sources of information and directly integrates them into our paraphrasing system’s log-linear model. We compare different distributional similarity feature-sets and show significant improvements in grammaticality and meaning retention on the example text-to-text generation task of sentence compression, achieving state-of-the-art quality.
@inproceedings{Ganitkevitch-etal:2012:StarSEM,
author = {Ganitkevitch, Juri and Van Durme, Benjamin and Callison-Burch, Chris},
title = {Monolingual Distributional Similarity for Text-to-Text Generation},
booktitle = {*SEM First Joint Conference on Lexical and Computational Semantics},
address = {Montreal},
publisher = {Association for Computational Linguistics},
url = {http://cs.jhu.edu/~ccb/publications/monolingual-distributional-similarity-for-text-to-text-generation.pdf}
}

Machine Translation of Arabic Dialects
Rabih Zbib, Erika Malchiodi, Jacob Devlin, David Stallard, Spyros Matsoukas, Richard Schwartz, John Makhoul, Omar Zaidan and Chris Callison-Burch
The 2012 Conference of the North American Chapter of the Association for Computational Linguistics – 2012

[abstract] [bib]

Abstract

Arabic dialects present many challenges for machine translation, not least of which is the lack of data resources. We use crowdsourcing to cheaply and quickly build Levantine-English and Egyptian-English parallel corpora, consisting of 1.1M words and 380k words, respectively. The dialect sentences are selected from a large corpus of Arabic web text, and translated using Mechanical Turk. We use this data to build Dialect Arabic MT systems. Small amounts of dialect data have a dramatic impact on the quality of translation. When translating Egyptian and Levantine test sets, our Dialect Arabic MT system performs 5.8 and 6.8 BLEU points higher than a Modern Standard Arabic MT system trained on a 150 million word Arabic-English parallel corpus -- over 100 times the amount of data as our dialect corpora.
@inproceedings{Zbib-etal:2012:NAACL,
author = {Rabih Zbib and Erika Malchiodi and Jacob Devlin and David Stallard and Spyros Matsoukas and Richard Schwartz and John Makhoul and Zaidan, Omar and Callison-Burch, Chris},
title = {Machine Translation of Arabic Dialects},
booktitle = {The 2012 Conference of the North American Chapter of the Association for Computational Linguistics},
address = {Montreal},
publisher = {Association for Computational Linguistics},
url = {http://cs.jhu.edu/~ccb/publications/machine-translation-of-arabic-dialects.pdf}
}

Training and Evaluating a Statistical Part of Speech Tagger for Natural Language Applications using Kepler Workflows
Doug Briesch, Reginald Hobbs, Claire Jaja, Brian Kjersten and Clare Voss
Procedia Computer Science – 2012

[abstract] [bib]

Abstract

A core technology of natural language processing (NLP) incorporated into many text processing applications is a part of speech (POS) tagger, a software component that labels words in text with syntactic tags such as noun, verb, adjective, etc. These tags may then be used within more complex tasks such as parsing, question answering, and machine translation (MT). In this paper we describe the phases of our work training and evaluating statistical POS taggers on Arabic texts and their English translations using Kepler workflows. While the original objectives for encapsulating our research code within Kepler workflows were driven by software engineering needs to document and verify the re usability of our software, our research benefitted as well: the ease of rapid retraining and testing enabled our researchers to detect reporting discrepancies, document their source, independently validating the correct results.
@article{Briesch20121588,
author = {Doug Briesch and Reginald Hobbs and Claire Jaja and Kjersten, Brian and Clare Voss},
title = {Training and Evaluating a Statistical Part of Speech Tagger for Natural Language Applications using Kepler Workflows},
pages = {1588 - 1594},
url = {http://www.sciencedirect.com/science/article/pii/S1877050912002955}
}

Annotated Gigaword
Courtney Napoles, Matt Gormley and Benjamin Van Durme
AKBC-WEKEX Workshop at NAACL 2012 – 2012

[bib]

@inproceedings{napoles-EtAl:2012:Agiga,
author = {Napoles, Courtney and Gormley, Matt and Van Durme, Benjamin},
title = {Annotated Gigaword},
booktitle = {AKBC-WEKEX Workshop at NAACL 2012}
}

Cost-Sensitive Dynamic Feature Selection
He He, Hal Daumé III and Jason Eisner
ICML Workshop on Inferning: Interactions between Inference and Learning – 2012

[abstract] [bib]

Abstract

We present an instance-specific test-time dynamic feature selection algorithm. Our algorithm sequentially chooses features given previously selected features and their values. It stops the selection process to make a prediction according to a user-specified accuracy-cost trade-off. We cast the sequential decision-making problem as a Markov Decision Process and apply imitation learning techniques. We address the problem of learning and inference jointly in a simple multiclass classification setting. Experimental results on UCI datasets show that our approach achieves the same or higher accuracy using only a small fraction of features than static feature selection methods.
@inproceedings{he-et-al-2012-icmlw,
author = {He He and Hal Daumé III and Eisner, Jason},
title = {Cost-Sensitive Dynamic Feature Selection},
booktitle = {ICML Workshop on Inferning: Interactions between Inference and Learning},
url = {http://cs.jhu.edu/~jason/papers/#icmlw12-dynfeat}
}

Fast and Accurate Prediction via Evidence-Specific MRF Structure
Veselin Stoyanov and Jason Eisner
ICML Workshop on Inferning: Interactions between Inference and Learning – 2012

[abstract] [bib]

Abstract

We are interested in speeding up approximate inference in Markov Random Fields (MRFs). We present a new method that uses gatesbinary random variables that determine which factors of the MRF to use. Which gates are open depends on the observed evidence; when many gates are closed, the MRF takes on a sparser and faster structure that omits "unnecessary" factors. We train parameters that control the gates, jointly with the ordinary MRF parameters, in order to locally minimize an objective that combines loss and runtime.
@inproceedings{stoyanov-eisner-2012-icmlw,
author = {Stoyanov, Veselin and Eisner, Jason},
title = {Fast and Accurate Prediction via Evidence-Specific MRF Structure},
booktitle = {ICML Workshop on Inferning: Interactions between Inference and Learning},
url = {http://cs.jhu.edu/~jason/papers/#icmlw12-gates}
}

Implicitly Intersecting Weighted Automata using Dual Decomposition
Michael Paul and Jason Eisner
Proceedings of NAACL-HLT – 2012

[abstract] [bib]

Abstract

We propose an algorithm to find the best path through an intersection of arbitrarily many weighted automata, without actually performing the intersection. The algorithm is based on dual decomposition: the automata attempt to agree on a string by communicating about features of the string. We demonstrate the algorithm on the Steiner consensus string problem, both on synthetic data and on consensus decoding for speech recognition. This involves implicitly intersecting up to 100 automata.
@inproceedings{paul-eisner-2012-naacl,
author = {Paul, Michael and Eisner, Jason},
title = {Implicitly Intersecting Weighted Automata using Dual Decomposition},
booktitle = {Proceedings of NAACL-HLT},
pages = {232--242},
url = {http://cs.jhu.edu/~jason/papers/#naacl12-dd}
}

Unsupervised Learning on an Approximate Corpus
Jason Smith and Jason Eisner
Proceedings of NAACL-HLT – 2012

[abstract] [bib]

Abstract

Unsupervised learning techniques can take advantage of large amounts of unannotated text, but the largest text corpus (the Web) is not easy to use in its full form. Instead, we have statistics about this corpus in the form of n-gram counts (Brants and Franz, 2006). While n-gram counts do not directly provide sentences, a distribution over sentences can be estimated from them in the same way that n-gram language models are estimated. We treat this distribution over sentences as an approximate corpus and show how unsupervised learning can be performed on such a corpus using variational inference. We compare hidden Markov model (HMM) training on exact and approximate corpora of various sizes, measuring speed and accuracy on unsupervised part-of-speech tagging.
@inproceedings{smith-eisner-2012,
author = {Smith, Jason and Eisner, Jason},
title = {Unsupervised Learning on an Approximate Corpus},
booktitle = {Proceedings of NAACL-HLT},
pages = {131--141},
url = {http://cs.jhu.edu/~jason/papers/#naacl12-ngram}
}

Minimum-Risk Training of Approximate CRF-Based NLP Systems
Veselin Stoyanov and Jason Eisner
Proceedings of NAACL-HLT – 2012

[abstract] [bib]

Abstract

Conditional Random Fields (CRFs) are a popular formalism for structured prediction in NLP. It is well known how to train CRFs with certain topologies that admit exact inference, such as linear-chain CRFs. Some NLP phenomena, however, suggest CRFs with more complex topologies. Should such models be used, considering that they make exact inference intractable? Stoyanov et al. (2011) re- cently argued for training parameters to minimize the task-specific loss of whatever approximate inference and decoding methods will be used at test time. We apply their method to three NLP problems, showing that (i) using more complex CRFs leads to improved performance, and that (ii) minimum-risk training learns more accurate models.
@inproceedings{stoyanov-eisner-2012-naacl,
author = {Stoyanov, Veselin and Eisner, Jason},
title = {Minimum-Risk Training of Approximate CRF-Based NLP Systems},
booktitle = {Proceedings of NAACL-HLT},
pages = {120--130},
url = {http://cs.jhu.edu/~jason/papers/#naacl12-risk}
}

Learned Prioritization for Trading Off Accuracy and Speed
Jiarong Jiang, Adam Teichert, Hal Daumé III and Jason Eisner
ICML Workshop on Inferning: Interactions between Inference and Learning – 2012

[abstract] [bib]

Abstract

Users want natural language processing (NLP) systems to be both fast and accurate, but quality often comes at the cost of speed. The field has been manually exploring various speed-accuracy tradeoffs for particular problems or datasets. We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing (Kay, 1986). Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is too large to explore naively. We propose a hybrid reinforcement/apprenticeship learning algorithm that, even with few inexpensive features, can automatically learn weights that achieve competitive accuracies at significant improvements in speed over state-of-the-art baselines.
@inproceedings{jiang-et-al-2012-icmlw,
author = {Jiarong Jiang and Teichert, Adam and Hal Daumé III and Eisner, Jason},
title = {Learned Prioritization for Trading Off Accuracy and Speed},
booktitle = {ICML Workshop on Inferning: Interactions between Inference and Learning},
url = {http://cs.jhu.edu/~jason/papers/#icmlw12-ldp}
}

Shared Components Topic Models
Matt Gormley, Mark Dredze, Benjamin Van Durme and Jason Eisner
Proceedings of NAACL-HLT – 2012

[abstract] [bib]

Abstract

With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary. In this paper, we introduce the Shared Components Topic Model (SCTM), in which each topic is a normalized product of a smaller number of underlying component distributions. Our model learns these component distributions and the structure of how to combine subsets of them into topics. The SCTM can represent topics in a much more compact representation than LDA and achieves better perplexity with fewer parameters.
@inproceedings{gormley-et-al-2012-naacl,
author = {Gormley, Matt and Dredze, Mark and Van Durme, Benjamin and Eisner, Jason},
title = {Shared Components Topic Models},
booktitle = {Proceedings of NAACL-HLT},
pages = {783--792},
url = {http://cs.jhu.edu/~jason/papers/#naacl12-sctm}
}

Space Efficiencies in Discourse Modeling via Conditional Random Sampling
Brian Kjersten and Benjamin Van Durme
2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies – 2012

[abstract] [bib]

Abstract

Recent exploratory efforts in discourse-level language modeling have relied heavily on calculating Pointwise Mutual Information (PMI), which involves significant computation when done over large collections. Prior work has required aggressive pruning or independence assumptions to compute scores on large collections. We show the method of Conditional Random Sampling, thus far an underutilized technique, to be a space-efficient means of representing the sufficient statistics in discourse that underly recent PMI-based work. This is demonstrated in the context of inducing Shankian script-like structures over news articles.
@inproceedings{KjerstenVanDurme2012,
author = {Kjersten, Brian and Van Durme, Benjamin},
title = {Space Efficiencies in Discourse Modeling via Conditional Random Sampling},
booktitle = {2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
address = {Montreal, Canada},
publisher = {Association for Computational Linguistics},
pages = {513-517},
url = {http://www.aclweb.org/anthology/N/N12/N12-1056.pdf}
}

Stylometric Analysis of Scientific Articles
Shane Bergsma, Matt Post and David Yarowsky
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies – 2012

Tags: stylometry, syntax,   |  [abstract] [bib]

Abstract

We present an approach to automatically recover hidden attributes of scientific articles, such as whether the author is a native English speaker, whether the author is a male or a female, and whether the paper was published in a conference or workshop proceedings. We train classifiers to predict these attributes in computational linguistics papers. The classifiers perform well in this challenging domain, identifying non-native writing with 95% accuracy (over a baseline of 67%). We show the benefits of using syntactic features in stylometry; syntax leads to significant improvements over bag-of-words models on all three tasks, achieving 10% to 25% relative error reduction. We give a detailed analysis of which words and syntax most predict a particular attribute, and we show a strong correlation between our predictions and a paper’s number of citations.
@inproceedings{bergsma-post-yarowsky:2012:NAACL-HLT,
author = {Bergsma, Shane and Post, Matt and Yarowsky, David},
title = {Stylometric Analysis of Scientific Articles},
booktitle = {Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
address = {Montr\'eal, Canada},
publisher = {Association for Computational Linguistics},
pages = {327--337},
url = {http://www.aclweb.org/anthology/N12-1033}
}

Judging Grammaticality with Count-Induced Tree Substitution Grammars
Francis Ferraro, Matt Post and Benjamin Van Durme
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP – 2012

[abstract] [bib]

Abstract

Prior work has shown the utility of syntactic tree fragments as features in judging the grammaticality of text. To date such fragments have been extracted from derivations of Bayesian-induced Tree Substitution Grammars (TSGs). Evaluating on discriminative coarse and fine grammaticality classification tasks, we show that a simple, deterministic, count-based approach to fragment identification performs on par with the more complicated grammars of Post (2011). This represents a significant reduction in complexity for those interested in the use of such fragments in the development of systems for the educational domain.
@inproceedings{ferraro-post-vandurme:2012:BEA,
author = {Ferraro, Francis and Post, Matt and Van Durme, Benjamin},
title = {Judging Grammaticality with Count-Induced Tree Substitution Grammars},
booktitle = {Proceedings of the Seventh Workshop on Building Educational Applications Using NLP},
address = {Montr\'eal, Canada},
publisher = {Association for Computational Linguistics},
pages = {116--121},
url = {http://www.aclweb.org/anthology/W12-2013}
}

Toward Tree Substitution Grammars with Latent Annotations
Francis Ferraro, Benjamin Van Durme and Matt Post
Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure – 2012

[abstract] [bib]

Abstract

We provide a model that extends the split-merge framework of Petrov et al. (2006) to jointly learn latent annotations and Tree Substitution Grammars (TSGs). We then conduct a variety of experiments with this model, first inducing grammars on a portion of the Penn Treebank and the Korean Treebank 2.0, and next experimenting with grammar refinement from a single nonterminal and from the Universal Part of Speech tagset. We present qualitative analysis showing promising signs across all experiments that our combined approach successfully provides for greater flexibility in grammar induction within the structured guidance provided by the treebank, leveraging the complementary natures of these two approaches.
@inproceedings{ferraro-vandurme-post:2012:WILS,
author = {Ferraro, Francis and Van Durme, Benjamin and Post, Matt},
title = {Toward Tree Substitution Grammars with Latent Annotations},
booktitle = {Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure},
address = {Montr\'eal, Canada},
publisher = {Association for Computational Linguistics},
pages = {23--30},
url = {http://www.aclweb.org/anthology/W12-1904}
}

Toward Statistical Machine Translation without Parallel Corpora
Alex Klementiev, Ann Irvine, Chris Callison-Burch and David Yarowsky
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL) – 2012

[abstract] [bib]

Abstract

We estimate the parameters of a phrase-based statistical machine translation system from monolingual corpora instead of a bilingual parallel corpus. We extend existing research on bilingual lexicon induction to estimate both lexical and phrasal translation probabilities for MT-scale phrase-tables. We propose a novel algorithm to estimate re-ordering probabilities from monolingual data. We report translation results for an end-to-end translation system using these monolingual features alone. Our method only requires monolingual corpora in source and target languages, a small bilingual dictionary, and a small bitext for tuning feature weights. In this paper, we examine an idealization where a phrase-table is given. We examine the degradation in translation performance when bilingually estimated translation probabilities are removed, and show that 82%+ of the loss can be recovered with monolingually estimated features alone. We further show that our monolingual features add 1.5 BLEU points when combined with standard bilingually estimated phrase table features.
@InProceedings{klementiev-etal:2012:EACL,
author = {Alex Klementiev and Irvine, Ann and Callison-Burch, Chris and Yarowsky, David},
title = {Toward Statistical Machine Translation without Parallel Corpora},
booktitle = {Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
address = {Avignon, France},
publisher = {Association for Computational Linguistics},
url = {http://cs.jhu.edu/~ccb/publications/toward-statistical-machine-translation-without-parallel-corpora.pdf}
}

Learning Multivariate Distributions by Competitive Assembly of Marginals
Francisco Sanchez-Vega, Jason Eisner, Laurent Younes and Donald Geman
IEEE Transactions on Pattern Analysis and Machine Intelligence – 2012

[abstract] [bib]

Abstract

We present a new framework for learning high-dimensional multivariate probability distributions from estimated marginals. The approach is motivated by compositional models and Bayesian networks, and designed to adapt to small sample sizes. We start with a large, overlapping set of elementary statistical building blocks, or "primitives," which are low-dimensional marginal distributions learned from data. Each variable may appear in many primitives. Subsets of primitives are combined in a lego-like fashion to construct a probabilistic graphical model; only a small fraction of the primitives will participate in any valid construction. Since primitives can be precomputed, parameter estimation and structure search are separated. Model complexity is controlled by strong biases; we adapt the primitives to the amount of training data and impose rules which restrict the merging of them into allowable compositions. The likelihood of the data decomposes into a sum of local gains, one for each primitive in the final structure. We focus on a specific subclass of networks which are binary forests. Structure optimization corresponds to an integer linear program and the maximizing composition can be computed for reasonably large numbers of variables. Performance is evaluated using both synthetic data and real datasets from natural language processing and computational biology.
@article{sanchezvega-et-al-2012,
author = {Francisco Sanchez-Vega and Eisner, Jason and Laurent Younes and Donald Geman},
title = {Learning Multivariate Distributions by Competitive Assembly of Marginals},
url = {http://cs.jhu.edu/~jason/papers/#pami12}
}

Confidence-Weighted Linear Classification for Text Categorization
Koby Crammer, Mark Dredze and Fernando Pereira
2012

[abstract] [bib]

Abstract

Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of text-categorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing.
@article{Pereira:2011fk,
author = {Koby Crammer and Dredze, Mark and Fernando Pereira},
title = {Confidence-Weighted Linear Classification for Text Categorization}
}

Entity Clustering Across Languages
Spence Green, Nicholas Andrews, Matt Gormley, Mark Dredze and Christopher Manning
NAACL – 2012

[bib]

@inproceedings{Green2012,
author = {Spence Green and Andrews, Nicholas and Gormley, Matt and Dredze, Mark and Christopher Manning},
title = {Entity Clustering Across Languages},
booktitle = {NAACL}
}

New H-Infinity Bounds for the Recursive Least Squares Algorithm Exploiting Input Structure
Koby Crammer, Alex Kulesza and Mark Dredze
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) – 2012

[bib]

@inproceedings{Crammer:2012fk,
author = {Koby Crammer and Alex Kulesza and Dredze, Mark},
title = {New H-Infinity Bounds for the Recursive Least Squares Algorithm Exploiting Input Structure},
booktitle = {International Conference on Acoustics, Speech, and Signal Processing (ICASSP)}
}

Use of Modality and Negation in Semantically-Informed Syntactic MT
Kathryn Baker, Bonnie Dorr, Michael Bloodgood, Chris Callison-Burch, Nathaniel Filardo, Christine Piatko, Lori Levin and Scott Miller
Computational Linguistics – 2012

[abstract] [bib]

Abstract

This article describes the resource- and system-building efforts of an eight-week JHU Human Language Technology Center of Excellence Summer Camp for Applied Language Exploration (SCALE-2009) on Semantically-Informed Machine Translation (SIMT). We describe a new modality/negation (MN) annotation scheme, a (publicly available) MN lexicon, and two au- tomated MN taggers that we built using the annotation scheme and lexicon. Our annotation scheme isolates three components of modality and negation: a trigger (a word that conveys modality or negation), a target (an action associated with modality or negation) and a holder (an experiencer of modality). We describe how our MN lexicon was produced semi-automatically and we demonstrate that a structure-based MN tagger results in precision around 86% (depending on genre) for tagging of a standard LDC data set. We apply our MN annotation scheme to statistical machine translation using a syntactic framework that supports the inclusion of semantic annotations. Syntactic tags enriched with semantic annotations are assigned to parse trees in the target-language training texts through a process of tree grafting. While the focus of our work is modality and negation, the tree grafting procedure is general and supports other types of semantic information. We exploit this capability by including named entities, produced by a pre-existing tagger, in addition to the MN elements produced by the taggers described in this paper. The resulting system significantly outperformed a linguistically nave baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu-English test set. This finding supports the hypothesis that both syntactic and semantic information can improve translation quality.
@article{baker-etal:2012:CL,
author = {Kathryn Baker and Bonnie Dorr and Michael Bloodgood and Callison-Burch, Chris and Filardo, Nathaniel and Christine Piatko and Lori Levin and Scott Miller},
title = {Use of Modality and Negation in Semantically-Informed Syntactic MT},
url = {http://cs.jhu.edu/~ccb/publications/modality-and-negation-in-semantically-informed-syntactic-mt.pdf}
}

Processing Informal, Romanized Pakistani Text Messages
Ann Irvine, Jonathan Weese and Chris Callison-Burch
Proceedings of the NAACL Workshop on Language in Social Media – 2012

[abstract] [bib]

Abstract

Regardless of language, the standard character set for text messages (SMS) and many other social media platforms is the Roman alphabet. There are romanization conventions for some character sets, but they are used inconsistently in informal text, such as SMS. In this work, we convert informal, romanized Urdu messages into the native Arabic script and normalize non-standard SMS language. Doing so prepares the messages for existing downstream processing tools, such as machine translation, which are typically trained on well-formed, native script text. Our model combines information at the word and character levels, allowing it to handle out-of-vocabulary items. Compared with a baseline deterministic approach, our system reduces both word and character error rate by over 50%.
@inproceedings{IrvineWeeseCallisonburchSMS12,
author = {Irvine, Ann and Weese, Jonathan and Callison-Burch, Chris},
title = {Processing Informal, Romanized Pakistani Text Messages},
booktitle = {Proceedings of the NAACL Workshop on Language in Social Media},
address = {Montreal, Canada},
publisher = {Association for Computational Linguistics},
url = {http://www.cs.jhu.edu/~anni/papers/urduSMS/urduSMS.pdf}
}

Digitizing 18th-Century French Literature: Comparing transcription methods for a critical edition text
Ann Irvine, Laure Marcellesi and Afra Zomorodian
Proceedings of the NAACL Workshop on Computational Linguistics for Literature – 2012

[abstract] [bib]

Abstract

We compare four methods for transcribing early printed texts. Our comparison is through a case-study of digitizing an eighteenth-century French novel for a new critical edition: the 1784 Lettres tatiennes by Josphine de Monbart. We provide a detailed error analysis of transcription by optical character recognition (OCR), non-expert humans, and expert humans and weigh each technique based on accuracy, speed, cost and the need for scholarly overhead. Our findings are relevant to 18th-century French scholars as well as the entire community of scholars working to preserve, present, and revitalize interest in literature published before the digital age.
@inproceedings{IrvineMarcellesiZomorodianFrench12,
author = {Irvine, Ann and Laure Marcellesi and Afra Zomorodian},
title = {Digitizing 18th-Century French Literature: Comparing transcription methods for a critical edition text},
booktitle = {Proceedings of the NAACL Workshop on Computational Linguistics for Literature},
address = {Montreal, Canada},
publisher = {Association for Computational Linguistics},
url = {http://www.cs.jhu.edu/~anni/papers/IrvineMonbartNAACL.pdf}
}

Expectations of Word Sense in Parallel Corpora
Xuchen Yao, Benjamin Van Durme and Chris Callison-Burch
NAACL – 2012

[bib]

@inproceedings{Yao2012NAACL,
author = {Yao, Xuchen and Van Durme, Benjamin and Callison-Burch, Chris},
title = {Expectations of Word Sense in Parallel Corpora},
booktitle = {NAACL},
url = {http://cs.jhu.edu/~xuchen/paper/Yao2012NAACL.pdf}
}

Semantics-based Question Generation and Implementation
Xuchen Yao, Gosse Bouma and Zhaonian Zhang
Dialogue and Discourse, Special Issue on Question Generation – 2012

[bib]

@article{Yao2012DDqg,
author = {Yao, Xuchen and Gosse Bouma and Zhang, Zhaonian},
title = {Semantics-based Question Generation and Implementation},
pages = {11-42},
url = {http://cs.jhu.edu/~xuchen/paper/Yao2012DDqg.pdf}
}

Sample Selection for Large-scale MT Discriminative Training
Yuan Cao and Sanjeev Khudanpur
Proceedings of the Annual Conference of the Association for Machine Translation in the Americas(AMTA) – 2012

[bib]

@Proceedings{sampleselection,
author = {Yuan Cao and Khudanpur, Sanjeev},
title = {Sample Selection for Large-scale MT Discriminative Training},
address = {San Diego, US}
}

Automatic Measurement of Positive and Negative Voice Onset Time
Katharine Henry, Morgan Sonderegger and Joseph Keshet
Interspeech – 2012

[bib]

@inproceedings{VOT2012,
author = {Henry, Katharine and Morgan Sonderegger and Joseph Keshet},
title = {Automatic Measurement of Positive and Negative Voice Onset Time},
booktitle = {Interspeech}
}

Generating Exact Lattices in The WFST Framework
Daniel Povey, Mirko Hannemann, Gilles Boulianne, Luk�� Burget, Arnab Ghoshal, Milo� Janda, Martin Karafi�t, Stefan Kombrink, Petr Motl�ček, Yanmin Qian, Korbinian Riedhammer, Karel Vesel� and Thang Vu
Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing – 2012

[bib]

@inproceedings{poveyexactlattice,
author = {Povey, Daniel and Mirko Hannemann and Gilles Boulianne and Luk�� Burget and Ghoshal, Arnab and Milo� Janda and Martin Karafi�t and Stefan Kombrink and Petr Motl�ček and Yanmin Qian and Korbinian Riedhammer and Karel Vesel� and Thang Vu},
title = {Generating Exact Lattices in The WFST Framework},
booktitle = {Proceedings of 2012 IEEE International Conference on Acoustics, Speech and Signal Processing},
publisher = {IEEE Signal Processing Society},
pages = {4213--4216}
}

Back to Top

2011

Generating More Specific Questions
Xuchen Yao
AAAI Symposium on Question Generation – 2011

[bib]

@inproceedings{Yao2011QG,
author = {Yao, Xuchen},
title = {Generating More Specific Questions},
booktitle = {AAAI Symposium on Question Generation},
address = {Arlington, VA},
url = {http://cs.jhu.edu/~xuchen/paper/Yao2011QG.pdf}
}

NADA: A Robust System for Non-Referential Pronoun Detection
Shane Bergsma and David Yarowsky
Proc. DAARC – 2011

[bib]

@inproceedings{Bergsma:11,
author = {Bergsma, Shane and Yarowsky, David},
title = {NADA: A Robust System for Non-Referential Pronoun Detection},
booktitle = {Proc. DAARC},
address = {Faro, Portugal}
}

Arabic Optical Character Recognition (OCR) Evaluation in Order to Develop a Post-OCR Module
Brian Kjersten
2011

[abstract] [details] [bib]

Abstract

Optical character recognition (OCR) is the process of converting an image of a document into text. While progress in OCR research has enabled low error rates for English text in low-noise images, performance is still poor for noisy images and documents in other languages. We intend to create a post-OCR processing module for noisy Arabic documents which can correct OCR errors before passing the resulting Arabic text to a translation system. To this end, we are evaluating an Arabic-script OCR engine on documents with the same content but varying levels of image quality. We have found that OCR text accuracy can be improved with different stages of pre-OCR image processing: (1) filtering out low-contrast images to avoid hallucination of characters, (2) removing marks from images with cleanup software to prevent their misrecognition, and (3) zoning multi-column images with segmentation software to enable recognition of all zones. The specific errors observed in OCR will form the basis of training data for our post-OCR correction module.

Additional Information

Memorandum Report number ARL-MR-0798

@techreport{kjersten2011,
author = {Kjersten, Brian},
title = {Arabic Optical Character Recognition (OCR) Evaluation in Order to Develop a Post-OCR Module},
publisher = {Army Research Laboratory},
pages = {1 - 18},
url = {http://www.stormingmedia.us/56/5644/A564455.html}
}

Using Visual Information to Predict Lexical Preference
Shane Bergsma and Randy Goebel
Proc. RANLP – 2011

[bib]

@inproceedings{Bergsma:11,
author = {Bergsma, Shane and Randy Goebel},
title = {Using Visual Information to Predict Lexical Preference},
booktitle = {Proc. RANLP},
address = {Hissar, Bulgaria},
pages = {399--405}
}

Event Selection from Phone Posteriorgrams Using Matched Filters
Keith Kintzley, Aren Jansen and Hynek Hermansky
Proc. of INTERSPEECH – 2011

[abstract] [bib]

Abstract

In this paper we address the issue of how to select a minimal set of phonetic events from a phone posteriorgram while minimizing the loss of information. We derive phone posteriorgrams from two sources, Gaussian mixture models and sparse multilayer perceptrons, and apply phone-specific matched filters to the posteriorgrams to yield a smaller set of phonetic events. We introduce a mutual information based performance measure to compare phonetic event selection techniques and demonstrate that events extracted using matched filters can reduce input data while significantly improving performance of an event-based keyword spotting system
@inproceedings{kintzley-jansen-hermansky:is2011,
author = {Kintzley, Keith and Jansen, Aren and Hermansky, Hynek},
title = {Event Selection from Phone Posteriorgrams Using Matched Filters},
address = {Florence, Italy},
pages = {1905--1908},
url = {http://old-site.clsp.jhu.edu/~ajansen/papers/IS2011c.pdf}
}

Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation
Jason Riesa, Ann Irvine and Daniel Marcu
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP) – 2011

[abstract] [bib]

Abstract

We present an accurate word alignment algorithm that heavily exploits source and target-language syntax. Using a discriminative framework and an efficient bottom-up search algorithm, we train a model of hundreds of thousands of syntactic features. Out new model (1) helps us to very accurately model syntactic transformations between languages; (2) is language-independent; and (3) with automatic feature extraction, assists system developers in obtaining good word-alignment performance off-the-shelf when tackling new language pairs. We analyze the impact of our features, describe inference under the model, and demonstrate significant alignment and translation quality improvements over already-powerful baselines trained on very large corpora. We observe translation quality improvements corresponding to 1.0 and 1.3 BLEU for Arabic-English and Chinese-English, respectively.
@inproceedings{riesa-irvine-marcu:2011:EMNLP,
author = {Jason Riesa and Irvine, Ann and Daniel Marcu},
title = {Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
address = {Edinburgh, Scotland, UK.},
publisher = {Association for Computational Linguistics},
pages = {497--507},
url = {http://dl.acm.org/citation.cfm?id=2145432.2145490}
}

Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
Juri Ganitkevitch, Chris Callison-Burch, Courtney Napoles and Benjamin Van Durme
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing – 2011

[abstract] [bib]

Abstract

Previous work has shown that high quality phrasal paraphrases can be extracted from bilingual parallel corpora. However, it is not clear whether bitexts are an appropriate resource for extracting more sophisticated sentential paraphrases, which are more obviously learnable from monolingual parallel corpora. We extend bilingual paraphrase extraction to syntactic paraphrases and demonstrate its ability to learn a variety of general paraphrastic transformations, including passivization, dative shift, and topicalization. We discuss how our model can be adapted to many text generation tasks by augmenting its feature set, development data, and parameter estimation routine. We illustrate this adaptation by using our paraphrase model for the task of sentence compression and achieve results competitive with state-of-the-art compression systems.
@inproceedings{ganitkevitch-EtAl:2011:EMNLP,
author = {Ganitkevitch, Juri and Callison-Burch, Chris and Napoles, Courtney and Van Durme, Benjamin},
title = {Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing},
address = {Edinburgh, Scotland, UK.},
publisher = {Association for Computational Linguistics},
pages = {1168-1179},
url = {http://www.aclweb.org/anthology/D11-1108}
}

Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor
Jonathan Weese, Juri Ganitkevitch, Chris Callison-Burch, Matt Post and Adam Lopez
Proceedings of the Sixth Workshop on Statistical Machine Translation – 2011

[abstract] [bib]

Abstract

We present progress on Joshua, an open source decoder for hierarchical and syntax-based machine translation. The main focus is describing Thrax, a flexible, open source synchronous context-free grammar extractor. Thrax extracts both hierarchical (Chiang, 2007) and syntax-augmented machine translation (Zollmann and Venugopal, 2006) grammars. It is built on Apache Hadoop for efficient distributed performance, and can easily be extended with support for new grammars, feature functions, and output formats.
@inproceedings{weese-EtAl:2011:WMT,
author = {Weese, Jonathan and Ganitkevitch, Juri and Callison-Burch, Chris and Post, Matt and Lopez, Adam},
title = {Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {478--484},
url = {http://www.aclweb.org/anthology/W11-2160}
}

Findings of the 2011 Workshop on Statistical Machine Translation
Chris Callison-Burch, Philipp Koehn, Christof Monz and Omar Zaidan
Proceedings of the Sixth Workshop on Statistical Machine Translation – 2011

[abstract] [bib]

Abstract

This paper presents the results of the WMT11 shared tasks, which included a translation task, a system combination task, and a task for machine translation evaluation metrics. We conducted a large-scale manual evaluation of 148 machine translation systems and 41 system combination entries. We used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality for 21 evaluation metrics. This year featured a Haitian Creole to English task translating SMS messages sent to an emergency response service in the aftermath of the Haitian earthquake. We also conducted a pilot ‘tunable metrics’ task to test whether optimizing a fixed system to different metrics would result in perceptibly different translation quality.
@inproceedings{callisonburch-EtAl:2011:WMT,
author = {Callison-Burch, Chris and Philipp Koehn and Christof Monz and Zaidan, Omar},
title = {Findings of the 2011 Workshop on Statistical Machine Translation},
booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
address = {Edinburgh, Scotland},
publisher = {Association for Computational Linguistics},
pages = {22--64},
url = {http://www.aclweb.org/anthology/W11-2103}
}

You Are What You Tweet : Analyzing Twitter for Public Health
Michael Paul and Mark Dredze
5th Interational Conference on Weblogs and Social Media – 2011

[bib]

@inproceedings{citeulike:9834165,
author = {Paul, Michael and Dredze, Mark},
title = {You Are What You Tweet : Analyzing Twitter for Public Health},
booktitle = {5th Interational Conference on Weblogs and Social Media},
publisher = {AAAI Press},
pages = {265--272},
url = {http://www.cs.jhu.edu/~mpaul/files/2011.icwsm.twitter_health.pdf}
}

Learning Bilingual Lexicons using the Visual Similarity of Labeled Web Images
Shane Bergsma and Benjamin Van Durme
Proc. IJCAI – 2011

[bib]

@inproceedings{Bergsma:11,
author = {Bergsma, Shane and Van Durme, Benjamin},
title = {Learning Bilingual Lexicons using the Visual Similarity of Labeled Web Images},
booktitle = {Proc. IJCAI},
address = {Barcelona, Spain},
pages = {1764--1769}
}

Reranking Bilingually Extracted Paraphrases Using Monolingual Distributional Similarity
Charley Chan, Chris Callison-Burch and Benjamin Van Durme
Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics – 2011

[abstract] [bib]

Abstract

This paper improves an existing bilingual paraphrase extraction technique using monolingual distributional similarity to rerank candidate paraphrases. Raw monolingual data provides a complementary and orthogonal source of information that lessens the commonly observed errors in bilingual pivot-based methods. Our experiments reveal that monolingual scoring of bilingually extracted paraphrases has a significantly stronger correlation with human judgment for grammaticality than the probabilities assigned by the bilingual pivoting method does. The results also show that monolingual distribution similarity can serve as a threshold for high precision paraphrase selection.
@inproceedings{chan-callisonburch-vandurme:2011:GEMS,
author = {Chan, Charley and Callison-Burch, Chris and Van Durme, Benjamin},
title = {Reranking Bilingually Extracted Paraphrases Using Monolingual Distributional Similarity},
booktitle = {Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics},
address = {Edinburgh, UK},
publisher = {Association for Computational Linguistics},
pages = {33--42},
url = {http://www.aclweb.org/anthology/W11-2504}
}

Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model
Markus Dreyer and Jason Eisner
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) – 2011

[abstract] [bib]

Abstract

We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted finite-state transducers with language-specific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50-100 seed paradigms, adding a 10-million-word corpus reduces prediction error for morphological inflections by up to 10%.
@inproceedings{dreyer-eisner-2011,
author = {Dreyer, Markus and Eisner, Jason},
title = {Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model},
booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
address = {Edinburgh},
pages = {616--627},
url = {http://cs.jhu.edu/~jason/papers/#emnlp11-morphcorpus}
}

WikiTopics: What is Popular on Wikipedia and Why
Byung Gyu Ahn, Benjamin Van Durme and Chris Callison-Burch
Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages – 2011

[abstract] [bib]

Abstract

We establish a novel task in the spirit of news summarization and topic detection and tracking (TDT): daily determination of the topics newly popular with Wikipedia readers. Central to this effort is a new public dataset consisting of the hourly page view statistics of all Wikipedia articles over the last three years. We give baseline results for the tasks of: discovering individual pages of interest, clustering these pages into coherent topics, and extracting the most relevant summarizing sentence for the reader. When compared to human judgements, our system shows the viability of this task, and opens the door to a range of exciting future work.
@inproceedings{ahn-vandurme-callisonburch:2011:SummarizationWorkshop,
author = {Ahn, Byung Gyu and Van Durme, Benjamin and Callison-Burch, Chris},
title = {WikiTopics: What is Popular on Wikipedia and Why},
booktitle = {Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages},
address = {Portland, Oregon},
publisher = {Association for Computational Linguistics},
pages = {33--40},
url = {http://www.aclweb.org/anthology/W11-0505}
}

Evaluating Sentence Compression: Pitfalls and Suggested Remedies
Courtney Napoles, Benjamin Van Durme and Chris Callison-Burch
Proceedings of the Workshop on Monolingual Text-To-Text Generation – 2011

[abstract] [bib]

Abstract

This work surveys existing evaluation methodologies for the task of sentence compression, identifies their shortcomings, and proposes alternatives. In particular, we examine the problems of evaluating paraphrastic compression and comparing the output of different models. We demonstrate that compression rate is a strong predictor of compression quality and that perceived improvement over other models is often a side effect of producing longer output.
@inproceedings{napoles-vandurme-callisonburch:2011:T2TW-2011,
author = {Napoles, Courtney and Van Durme, Benjamin and Callison-Burch, Chris},
title = {Evaluating Sentence Compression: Pitfalls and Suggested Remedies},
booktitle = {Proceedings of the Workshop on Monolingual Text-To-Text Generation},
address = {Portland, Oregon},
publisher = {Association for Computational Linguistics},
pages = {91--97},
url = {http://www.aclweb.org/anthology/W11-1611}
}

Paraphrastic Sentence Compression with a Character-based Metric: Tightening without Deletion
Courtney Napoles, Chris Callison-Burch, Juri Ganitkevitch and Benjamin Van Durme
Proceedings of the Workshop on Monolingual Text-To-Text Generation – 2011

Tags: paraphrasing  |  [abstract] [bib]

Abstract

We present a substitution-only approach to sentence compression which “tightens” a sentence by reducing its character length. Replacing phrases with shorter paraphrases yields paraphrastic compressions as short as 60\% of the original length. In support of this task, we introduce a novel technique for re-ranking paraphrases extracted from bilingual corpora. At high compression rates1 paraphrastic compressions outperform a state-of-the-art deletion model in an oracle experiment. For further compression, deleting from oracle paraphrastic compressions preserves more meaning than deletion alone. In either setting, paraphrastic compression shows promise for surpassing deletion-only methods.
@inproceedings{napoles-EtAl:2011:T2TW-2011,
author = {Napoles, Courtney and Callison-Burch, Chris and Ganitkevitch, Juri and Van Durme, Benjamin},
title = {Paraphrastic Sentence Compression with a Character-based Metric: Tightening without Deletion},
booktitle = {Proceedings of the Workshop on Monolingual Text-To-Text Generation},
address = {Portland, Oregon},
publisher = {Association for Computational Linguistics},
pages = {84--90},
url = {http://www.aclweb.org/anthology/W11-1610}
}

Paraphrase Fragment Extraction from Monolingual Comparable Corpora
Rui Wang and Chris Callison-Burch
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web – 2011

[abstract] [bib]

Abstract

We present a novel paraphrase fragment pair extraction method that uses a monolingual comparable corpus containing different articles about the same topics or events. The procedure consists of document pair extraction, sentence pair extraction, and fragment pair extraction. At each stage, we evaluate the intermediate results manually, and tune the later stages accordingly. With this minimally supervised approach, we achieve 62% of accuracy on the paraphrase fragment pairs we collected and 67% extracted from the MSR corpus. The results look promising, given the minimal supervision of the approach, which can be further scaled up.
@inproceedings{wang-callisonburch:2011:BUCC,
author = {Rui Wang and Callison-Burch, Chris},
title = {Paraphrase Fragment Extraction from Monolingual Comparable Corpora},
booktitle = {Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web},
address = {Portland, Oregon},
publisher = {Association for Computational Linguistics},
pages = {52--60},
url = {http://www.aclweb.org/anthology/W11-1208}
}

The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content
Omar Zaidan and Chris Callison-Burch
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – 2011

[abstract] [bib]

Abstract

The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of automatic dialect identification, using the collected labels for training and evaluation.
@inproceedings{zaidan-callisonburch:2011:ACL-HLT2011,
author = {Zaidan, Omar and Callison-Burch, Chris},
title = {The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {37--41},
url = {http://www.aclweb.org/anthology/P11-2007}
}

Crowdsourcing Translation: Professional Quality from Non-Professionals
Omar Zaidan and Chris Callison-Burch
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – 2011

[abstract] [bib]

Abstract

Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional translators. The total cost is more than an order of magnitude lower than professional translation.
@inproceedings{zaidan-callisonburch:2011:ACL-HLT2011,
author = {Zaidan, Omar and Callison-Burch, Chris},
title = {Crowdsourcing Translation: Professional Quality from Non-Professionals},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {1220--1229},
url = {http://www.aclweb.org/anthology/P11-1122}
}

Incremental Syntactic Language Models for Phrase-based Translation
Lane Schwartz, Chris Callison-Burch, William Schuler and Stephen Wu
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – 2011

[abstract] [bib]

Abstract

This paper describes a novel technique for incorporating syntactic knowledge into phrasebased machine translation through incremental syntactic parsing. Bottom-up and topdown parsers typically require a completed string as input. This requirement makes it difficult to incorporate them into phrase-based translation, which generates partial hypothesized translations from left-to-right. Incremental syntactic language models score sentences in a similar left-to-right fashion, and are therefore a good mechanism for incorporating syntax into phrase-based translation. We give a formal definition of one such lineartime syntactic language model, detail its relation to phrase-based decoding, and integrate the model with the Moses phrase-based translation system. We present empirical results on a constrained Urdu-English translation task that demonstrate a significant BLEU score improvement and a large decrease in perplexity.
@inproceedings{schwartz-EtAl:2011:ACL-HLT20111,
author = {Lane Schwartz and Callison-Burch, Chris and William Schuler and Stephen Wu},
title = {Incremental Syntactic Language Models for Phrase-based Translation},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {620--631},
url = {http://www.aclweb.org/anthology/P11-1063}
}

Nonparametric Bayesian Word Sense Induction
Xuchen Yao and Benjamin Van Durme
Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing – 2011

[bib]

@inproceedings{Yao2011WSI,
author = {Yao, Xuchen and Van Durme, Benjamin},
title = {Nonparametric Bayesian Word Sense Induction},
booktitle = {Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing},
address = {Portland, Oregon},
publisher = {Association for Computational Linguistics},
pages = {10--14},
url = {http://cs.jhu.edu/~xuchen/paper/Yao2011WSI.pdf}
}

Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
Shane Bergsma, David Yarowsky and Kenneth Church
Proc. ACL – 2011

[bib]

@inproceedings{Bergsma:11,
author = {Bergsma, Shane and Yarowsky, David and Church, Kenneth},
title = {Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation},
booktitle = {Proc. ACL},
address = {Portland, Oregon},
pages = {1346--1355}
}

Joint Training of Dependency Parsing Filters through Latent Support Vector Machines
Colin Cherry and Shane Bergsma
Proc. ACL – 2011

[bib]

@inproceedings{Cherry:11,
author = {Colin Cherry and Bergsma, Shane},
title = {Joint Training of Dependency Parsing Filters through Latent Support Vector Machines},
booktitle = {Proc. ACL},
address = {Portland, Oregon},
pages = {200--205}
}

Judging Grammaticality with Tree Substitution Grammar Derivations
Matt Post
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – 2011

Tags: text classification, grammaticality, tree substitution grammar  |  [abstract] [bib]

Abstract

In this paper, we show that local features computed from the derivations of tree substitution grammars such as the identify of particular fragments, and a count of large and small fragments are useful in binary grammatical classification tasks. Such features outperform n-gram features and various model scores by a wide margin. Although they fall short of the performance of the hand-crafted feature set of Charniak and Johnson (2005) developed for parse tree reranking, they do so with an order of magnitude fewer features. Furthermore, since the TSGs employed are learned in a Bayesian setting, the use of their derivations can be viewed as the automatic discovery of tree patterns useful for classification. On the BLLIP dataset, we achieve an accuracy of 89.9% in discriminating between grammatical text and samples from an n-gram language model.
@inproceedings{post:2011:ACL-HLT2011,
author = {Post, Matt},
title = {Judging Grammaticality with Tree Substitution Grammar Derivations},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {217--222},
url = {http://www.aclweb.org/anthology/P11-2038}
}

Variational Approximation of Long-Span Language Models for LVCSR
Anoop Deoras, Tomáš Mikolov, Stefan Kombrink, Martin Karafiát and Sanjeev Khudanpur
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing – 2011

[abstract] [bib]

Abstract

Long-span language models that capture syntax and semantics are seldom used in the first pass of large vocabulary continuous speech recognition systems due to the prohibitive search-space of sentence-hypotheses. Instead, an N-best list of hypotheses is created using tractable n-gram models, and rescored using the long-span models. It is shown in this paper that computationally tractable variational approximations of the long-span models are a better choice than standard ra-gram models for first pass decoding. They not only result in a better first pass output, but also produce a lattice with a lower oracle word error rate, and rescoring the N-best list from such lattices with the long-span models requires a smaller N to attain the same accuracy. Empirical results on the WSJ, MIT Lectures, NIST 2007 Meeting Recognition and NIST 2001 Conversational Telephone Recognition data sets are presented to support these claims.
@inproceedings{deoras2011variational,
author = {Deoras, Anoop and Tomáš Mikolov and Stefan Kombrink and Martin Karafiát and Khudanpur, Sanjeev},
title = {Variational Approximation of Long-Span Language Models for LVCSR},
booktitle = {Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing},
address = {Prague, Czech Republic},
pages = {5532-5535},
url = {http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5947612}
}

Extensions of Recurrent Neural Network Language Model
Tomas Mikolov, Stefan Kombrink, Lukas Burget, Jan Cernocky and Sanjeev Khudanpur
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing – 2011

[abstract] [bib]

Abstract

We present several modifications of the original recurrent neural net work language model (RNN LM). While this model has been shown to significantly outperform many competitive language modeling techniques in terms of accuracy, the remaining problem is the computational complexity. In this work, we show approaches that lead to more than 15 times speedup for both training and testing phases. Next, we show importance of using a backpropagation through time algorithm. An empirical comparison with feedforward networks is also provided. In the end, we discuss possibilities how to reduce the amount of parameters in the model. The resulting RNN model can thus be smaller, faster both during training and testing, and more accurate than the basic one.
@inproceedings{mikolov2011extensions,
author = {Tomas Mikolov and Stefan Kombrink and Lukas Burget and Jan Cernocky and Khudanpur, Sanjeev},
title = {Extensions of Recurrent Neural Network Language Model},
booktitle = {Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing},
address = {Prague, Czech Republic},
pages = {5528--5531},
url = {http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5947611}
}

Hill Climbing on Speech Lattices: A New Rescoring Framework
Ariya Rastrow, Markus Dreyer, Abhinav Sethy, Sanjeev Khudanpur, Bhuvana Ramabhadran and Mark Dredze
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing – 2011

[abstract] [bib]

Abstract

We describe a new approach for rescoring speech lattices - with long-span language models or wide-context acoustic models - that does not entail computationally intensive lattice expansion or limited rescoring of only an N-best list. We view the set of word-sequences in a lattice as a discrete space equipped with the edit-distance metric, and develop a hill climbing technique to start with, say, the 1-best hypothesis under the lattice-generating model(s) and iteratively search a local neighborhood for the highest-scoring hypothesis under the rescoring model(s); such neighborhoods are efficiently constructed via finite state techniques. We demonstrate empirically that to achieve the same reduction in error rate using a better estimated, higher order language model, our technique evaluates fewer utterance-length hypotheses than conventional N-best rescoring by two orders of magnitude. For the same number of hypotheses evaluated, our technique results in a significantly lower error rate.
@inproceedings{rastrow2011hill,
author = {Rastrow, Ariya and Markus Dreyer and Abhinav Sethy and Khudanpur, Sanjeev and Bhuvana Ramabhadran and Dredze, Mark},
title = {Hill Climbing on Speech Lattices: A New Rescoring Framework},
booktitle = {Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing},
address = {Prague, Czech Republic},
pages = {5032-5035},
url = {http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5947487}
}

Learning and Inference Algorithms for Partially-Observed Structured Switching Vector Autoregressive Models
Balakrishnan Varadarajan and Sanjeev Khudanpur
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing – 2011

[abstract] [bib]

Abstract

We present learning and inference algorithms for a versatile class of partially observed vector autoregressive (VAR) models for multivariate time-series data. VAR models can capture wide variety of temporal dynamics in a continuous multidimensional signal. Given a sequence of observations to be modeled by a VAR model, it is possible to estimate its parameters in closed form by solving a least squares problem. For high dimensional observations, the state space representation of a linear system is often invoked. One advantage of doing so is that we model the dynamics of a low dimensional hidden state instead of the observations, which results in robust estimation of the dynamical system parameters. The commonly used approach is to project the high dimensional observation to the low dimensional state space using a KL transform. In this article, we propose a novel approach to automatically discover the low dimensional dynamics in a switching VAR model by imposing discriminative structure on the model parameters. We demonstrate its efficacy via significant improvements in gesture recognition accuracy over a standard hidden Markov model, which does not take the state-conditional dynamics of the observations into account, on a bench-top suturing task.
@inproceedings{varadarajan2011learning,
author = {Varadarajan, Balakrishnan and Khudanpur, Sanjeev},
title = {Learning and Inference Algorithms for Partially-Observed Structured Switching Vector Autoregressive Models},
booktitle = {Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing},
address = {Prague, Czech Republic},
pages = {1281-1284},
url = {http://ieeexplore.ieee.org/Xplore/login.jsp?url=http\%3A\%2F\%2Fieeexplore.ieee.org\%2Fiel5\%2F5916934\%2F5946226\%2F05946645.pdf\%3Farnumber\%3D5946645&authDecision=-203}
}

Dirichlet Mixtures to Model Neural Netwok Posteriors in the HMM Framework
Balakrishnan Varadarajan, Sri Garimella and Sanjeev Khudanpur
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing – 2011

[abstract] [bib]

Abstract

In this paper, we present a novel technique for modeling the posterior probability estimates obtained from a neural net work directly in the HMM framework using the Dirichlet Mixture Models (DMMs). Since posterior probability vectors lie on a probability simplex their distribution can be modeled using DMMs. Being in an exponential family, the parameters of DMMs can be estimated in an efficient manner. Conventional approaches like TANDEM attempt to gaussianize the posteriors by suitable transforms and model them using Gaussian Mixture Models (GMMs). This requires more number of parameters as it does not exploit the fact that the probability vectors lie on a simplex. We demonstrate through TIMIT phoneme recognition experiments that the proposed technique outperforms the conventional TANDEM approach.
@inproceedings{varadarajan2011dirichlet,
author = {Varadarajan, Balakrishnan and Garimella, Sri and Khudanpur, Sanjeev},
title = {Dirichlet Mixtures to Model Neural Netwok Posteriors in the HMM Framework},
booktitle = {Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing},
address = {Prague, Czech Republic},
pages = {5028-5031},
url = {http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5947486}
}

Johns Hopkins on the chip: microsystems and cognitive machines for sustainable, affordable, personalized medicine and health care (invited paper)
Andreas G Andreou
2011

[abstract] [bib]

Abstract

Semiconductor technology is contributing to the advancement of bio- technology, medicine and healthcare delivery in ways that it was never envisioned - from chip micro-arrays, to scientific grade CMOS imagers and ion sensing arrays to implantable prosthesis. This expo- nential growth of sensory microsystems has led to an exponential growth of data. Cognitive machines, i.e. advanced computer architectures and algorithms, are carefully co-designed to extract knowledge from such health data making rational decisions and recommendations for therapies. Nano, micro and macro robotics driven by sophisticated algorithms interface to the human body at different levels and scales, from nano-scale molecules to micron-scale cells to networks and all the way to the scale of organisms. The present era is one where semi- conductor technology and the 'chip' is the foundation of sustainable and affordable personalised medicine and healthcare delivery.
@article{Andreou:2011wf,
author = {Andreou, Andreas},
title = {Johns Hopkins on the chip: microsystems and cognitive machines for sustainable, affordable, personalized medicine and health care (invited paper)},
pages = {s34--s37}
}

Language Models for Semantic Extraction and Filtering in Video Action Recognition
Evelyne Tzoukermann, Jan Neumann, Jana Kosecka, Cornelia Fermuller, Ian Perera, Francis Ferraro, Benjamin Sapp, Rizwan Chaudry and Gautam Singh
AAAI Workshop on Language-Action Tools for Cognitive Artificial Agents – 2011

[bib]

@inproceedings{tzoukermann-aaai-2011,
author = {Evelyne Tzoukermann and Jan Neumann and Jana Kosecka and Cornelia Fermuller and Ian Perera and Ferraro, Francis and Benjamin Sapp and Rizwan Chaudry and Gautam Singh},
title = {Language Models for Semantic Extraction and Filtering in Video Action Recognition},
booktitle = {AAAI Workshop on Language-Action Tools for Cognitive Artificial Agents},
url = {http://cs.jhu.edu/~ferraro/papers.html#tzoukermann-aaai-2011}
}

Recognizing Manipulation Actions in Arts and Crafts Shows using Domain Specific Visual and Textual Cues
Benjamin Sapp, Rizwan Chaudry, Xiaodong Yu, Gautam Singh, Ian Perera, Francis Ferraro, Evelyne Tzoukermann, Jana Kosecka and Jan Neumann
The 3rd International Workshop on Video Event Categorization, Tagging and Retrieval for Real-World Applications (VECTaR2011) – 2011

[bib]

@inproceedings{sapp-vectar-2011,
author = {Benjamin Sapp and Rizwan Chaudry and Xiaodong Yu and Gautam Singh and Ian Perera and Ferraro, Francis and Evelyne Tzoukermann and Jana Kosecka and Jan Neumann},
title = {Recognizing Manipulation Actions in Arts and Crafts Shows using Domain Specific Visual and Textual Cues},
booktitle = {The 3rd International Workshop on Video Event Categorization, Tagging and Retrieval for Real-World Applications (VECTaR2011)},
url = {http://cs.jhu.edu/~ferraro/papers.html#sapp-vectar-2011}
}

Beyond Amdahl's law: An objective function that links multiprocessor performance gains to delay and energy
Andrew S Cassidy and Andreas G Andreou
2011

[bib]

@article{Cassidy:2011ws,
author = {Andrew S Cassidy and Andreou, Andreas},
title = {Beyond Amdahl's law: An objective function that links multiprocessor performance gains to delay and energy}
}

Minimum Imputed Risk Unsupervised Discriminative Training for Machine Translation
Zhifei Li, Ziyuan Wang, Jason Eisner, Sanjeev Khudanpur and Brian Roark
Proceedings of the 2011 Conference on Empirical Methods Natural Language Processing – 2011

[abstract] [bib]

Abstract

Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse” translation system. Intuitively, our method strives to ensure that probabilistic “round-trip” translation from a targetlanguage sentence to the source-language and back will have low expected loss. Theoretically, this may be justified as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.
@inproceedings{li2011minimum,
author = {Zhifei Li and Wang, Ziyuan and Eisner, Jason and Khudanpur, Sanjeev and Brian Roark},
title = {Minimum Imputed Risk Unsupervised Discriminative Training for Machine Translation},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods Natural Language Processing},
address = {Edinburgh,UK},
url = {http://www.aclweb.org/anthology/D11-1085}
}

Unsupervised Arabic Dialect Adaptation with Self Training
Scott Novotney, Rich Schwartz and Sanjeev Khudanpur
Proceedings of the 12th Annual Conference of the International Speech Communication Association – 2011

[abstract] [bib]

Abstract

Useful training data for automatic speech recognition systems of colloquial speech is usually limited to expensive in-domain transcription. Broadcast news is an appealing source of easily available data to bootstrap into a new dialect. However, some languages, like Arabic, have deep linguistic differences resulting in poor cross domain performance. If no in-domain transcripts are available, but a large amount of indomain audio is, self-training may be a suitable technique to bootstrap into the domain. In this work, we attempt to adapt Modern Standard Arabic (MSA) models to Levantine Arabic without any in-domain manual transcription. We contrast with varying amounts of in-domain transcription and show that 1) Self-training is effective with only one hour of indomain transcripts. 2) Self-training is not a suitable solution to improve strong MSA models on Levantine. 3) Two metrics that quantify model bias predict self-training success. 4) Model bias explains the failure of self-training to adapt across strong domain mismatch.
@inproceedings{novotney2011unsupervised,
author = {Novotney, Scott and Rich Schwartz and Khudanpur, Sanjeev},
title = {Unsupervised Arabic Dialect Adaptation with Self Training},
booktitle = {Proceedings of the 12th Annual Conference of the International Speech Communication Association},
address = {Florence,Italy},
url = {http://www.clsp.jhu.edu/people/snovotne/papers/novotney_interspeech11.pdf}
}

Efficient Subsampling for Training Complex Language Models
Puyang Xu, Asela Gunawardana and Sanjeev Khudanpur
Proceedings of the 2011 Conference on Empirical Methods Natural Language Processing – 2011

[abstract] [bib]

Abstract

We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1\% ∼ 5\% of the standard complexity with no loss in performance.
@inproceedings{xu2011efficient,
author = {Xu, Puyang and Asela Gunawardana and Khudanpur, Sanjeev},
title = {Efficient Subsampling for Training Complex Language Models},
booktitle = {Proceedings of the 2011 Conference on Empirical Methods Natural Language Processing},
address = {Edinburgh,UK},
url = {http://www.aclweb.org/anthology/D/D11/D11-1104.pdf}
}

Learning Speed-Accuracy Tradeoffs in Nondeterministic Inference Algorithms
Jason Eisner and Hal Daumé III
COST: NIPS 2011 Workshop on Computational Trade-offs in Statistical Learning – 2011

[abstract] [bib]

Abstract

Could we explicitly train test-time inference heuristics to trade off accuracy and efficiency? We focus our discussion on agenda-based natural language parsing under a weighted context-free grammar. We frame the problem as reinforcement learning, discuss its special properties, and propose new strategies.
@inproceedings{eisner-daume-2011,
author = {Eisner, Jason and Hal Daumé III},
title = {Learning Speed-Accuracy Tradeoffs in Nondeterministic Inference Algorithms},
booktitle = {COST: NIPS 2011 Workshop on Computational Trade-offs in Statistical Learning},
url = {http://cs.jhu.edu/~jason/papers/eisner+daume.nipsw11.pdf}
}

Human action categorization using ultrasound micro-Doppler signatures
Salvador Dura-Bernal, Guillaume Garreau, Charalambos Andreou, Andreas G Andreou, Julius Georgiou, Thomas Wennekers and Susan Denham
2011

[bib]

@article{DuraBernal:2011vs,
author = {Salvador Dura-Bernal and Guillaume Garreau and Charalambos Andreou and Andreou, Andreas and Julius Georgiou and Thomas Wennekers and Susan Denham},
title = {Human action categorization using ultrasound micro-Doppler signatures},
pages = {18--28}
}

A high-level analytical model for application specific CMP design exploration
Andrew S Cassidy, Kai Yu, Haolang Zhou and Andreas G Andreou
2011

[bib]

@article{Cassidy:2011uy,
author = {Andrew S Cassidy and Kai Yu and Zhou, Haolang and Andreou, Andreas},
title = {A high-level analytical model for application specific CMP design exploration}
}

Bio-Inspired Cognitive Analysis for Active and Passive Acoustic Sensors
Andreas G Andreou
2011

[bib]

@article{Andreou:2011vc,
author = {Andreou, Andreas},
title = {Bio-Inspired Cognitive Analysis for Active and Passive Acoustic Sensors}
}

Design of a one million neuron single FPGA neuromorphic system for real-time multimodal scene analysis
Andrew S Cassidy and Andreas G Andreou
45th Annual Conference on Information Sciences and Systems (CISS 2011) – 2011

[bib]

@inproceedings{Cassidy:2011vg,
author = {Andrew S Cassidy and Andreou, Andreas},
title = {Design of a one million neuron single FPGA neuromorphic system for real-time multimodal scene analysis},
booktitle = {45th Annual Conference on Information Sciences and Systems (CISS 2011)},
pages = {1--6}
}

A multimodal-corpus data collection system for cognitive acoustic scene analysis
Julius Georgiou, Philippe O Pouliquen, Andrew S Cassidy, Guillaume Garreau, Charalambos Andreou, Guillermo Stuarts, Cyrlle d'Urbal, Susan Denham, Thomas Wennekers, Robert Mill, Istvan Winkler, Tamas Bohm, Orsolya Szalardy, Georg Klump, Simon Jones, Alexandra Bendixen and Andreas G Andreou
2011

[bib]

@article{Georgiou:2011uy,
author = {Julius Georgiou and Philippe O Pouliquen and Andrew S Cassidy and Guillaume Garreau and Charalambos Andreou and Guillermo Stuarts and Cyrlle d'Urbal and Susan Denham and Thomas Wennekers and Robert Mill and Istvan Winkler and Tamas Bohm and Orsolya Szalardy and Georg Klump and Simon Jones and Alexandra Bendixen and Andreou, Andreas},
title = {A multimodal-corpus data collection system for cognitive acoustic scene analysis},
pages = {1--6}
}

Confusion Network Decoding for MT System Combination
Antti-Veikko Rosti, Eugene Matusov, Jason Smith, Necip Ayan, Jason Eisner, Damianos Karakos, Sanjeev Khudanpur, Gregor Leusch, Zhifei Li, Spyros Matsoukas, Hermann Ney, Richard Schwartz, B. Zhang and J. Zheng
Handbook of Natural Language Processing and Machine Translation – 2011

[bib]

@inbook{rosti2011confusion,
author = {Antti-Veikko Rosti and Eugene Matusov and Smith, Jason and Necip Ayan and Eisner, Jason and Karakos, Damianos and Khudanpur, Sanjeev and Gregor Leusch and Zhifei Li and Spyros Matsoukas and Hermann Ney and Richard Schwartz and B. Zhang and J. Zheng},
title = {Confusion Network Decoding for MT System Combination},
booktitle = {Handbook of Natural Language Processing and Machine Translation},
publisher = {Springer},
pages = {333-361},
url = {http://www.springer.com/computer/ai/book/978-1-4419-7712-0}
}

Forest Reranking for Machine Translation Using the Direct Translation Model
Zhi Li and Sanjeev Khudanpur
Handbook of Natural Language Processing and Machine Translation – 2011

[bib]

@inbook{li2011forest,
author = {Zhi Li and Khudanpur, Sanjeev},
title = {Forest Reranking for Machine Translation Using the Direct Translation Model},
booktitle = {Handbook of Natural Language Processing and Machine Translation},
publisher = {Springer},
pages = {226-236},
url = {http://www.springer.com/computer/ai/book/978-1-4419-7712-0}
}

Stepwise Optimal Subspace Pursuit for Improving Sparse Recovery
Balakrishnan Varadarajan, Sanjeev Khudanpur and Trac Tran
the IEEE Signal Processing Letters – 2011

[abstract] [bib]

Abstract

We propose a new iterative algorithm to reconstruct an unknown sparse signal x from a set of projected measurements y = Φx . Unlike existing methods, which rely crucially on the near orthogonality of the sampling matrix Φ , our approach makes stepwise optimal updates even when the columns of Φ are not orthogonal. We invoke a block-wise matrix inversion formula to obtain a closed-form expression for the increase (reduction) in the L2-norm of the residue obtained by removing (adding) a single element from (to) the presumed support of x . We then use this expression to design a computationally tractable algorithm to search for the nonzero components of x . We show that compared to currently popular sparsity seeking matching pursuit algorithms, each step of the proposed algorithm is locally optimal with respect to the actual objective function. We demonstrate experimentally that the algorithm significantly outperforms conventional techniques in recovering sparse signals whose nonzero values have exponentially decaying magnitudes or are distributed N(0,1) .
@article{varadarajan2011stepwise,
author = {Varadarajan, Balakrishnan and Khudanpur, Sanjeev and Trac Tran},
title = {Stepwise Optimal Subspace Pursuit for Improving Sparse Recovery},
booktitle = {the IEEE Signal Processing Letters},
pages = {27-30},
url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5639029&tag=1}
}

Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure
Veselin Stoyanov, Alexander Ropson and Jason Eisner
Proceedings of AISTATS – 2011

[abstract] [bib]

Abstract

Graphical models are often used "inappropriately," with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of the entire imperfect system. We show how to locally optimize this risk using back-propagation and stochastic metadescent. Over a range of synthetic-data problems, compared to the usual practice of choosing approximate MAP parameters, our approach significantly reduces loss on test data, sometimes by an order of magnitude.
@inproceedings{stoyanov-ropson-eisner-2011,
author = {Stoyanov, Veselin and Alexander Ropson and Eisner, Jason},
title = {Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure},
booktitle = {Proceedings of AISTATS},
url = {http://www.cs.jhu.edu/~jason/papers/stoyanov+al.aistats11.pdf}
}

Estimating Document Frequencies in a Speech Corpus
Damianos Karakos, Mark Dredze, Kenneth Church, Aren Jansen and Sanjeev Khudanpur
IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) – 2011

[abstract] [bib]

Abstract

Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df(w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task.
@inproceedings{Krakos:2011,
author = {Karakos, Damianos and Dredze, Mark and Church, Kenneth and Jansen, Aren and Khudanpur, Sanjeev},
title = {Estimating Document Frequencies in a Speech Corpus},
booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
url = {http://www.clsp.jhu.edu/~damianos/asru11_df_stats.pdf}
}

Adapting N-Gram Maximum Entropy Language Models with Conditional Entropy Regularization
Ariya Rastrow, Mark Dredze and Sanjeev Khudanpur
IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) – 2011

[abstract] [bib]

Abstract

Accurate estimates of language model parameters are critical for building quality text generation systems, such as automatic speech recognition. However, text training data for a domain of interest is often unavailable. Instead, we use semi-supervised model adaptation; parameters are estimated using both unlabeled in-domain data (raw speech audio) and labeled out of domain data (text.) In this work, we present a new semi-supervised language model adaptation procedure for Maximum Entropy models with n-gram features. We augment the conventional maximum likelihood training criterion on out-of- domain text data with an additional term to minimize conditional entropy on in-domain audio. Additionally, we demonstrate how to compute conditional entropy efficiently on speech lattices using first- and second-order expectation semirings. We demonstrate improvements in terms of word error rate over other adaptation techniques when adapting a maximum entropy language model from broadcast news to MIT lectures.
@inproceedings{Rastrow:2011fl,
author = {Rastrow, Ariya and Dredze, Mark and Khudanpur, Sanjeev},
title = {Adapting N-Gram Maximum Entropy Language Models with Conditional Entropy Regularization},
booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)}
}

Back to Top

Displaying 1 - 100 of 639 total matches