Publications

2013
Supervised Bilingual Lexicon Induction with Multiple Monolingual Signals
Ann Irvine and Chris Callison-Burch
to appear in the Proceedings of the North American Association for Computational Linguistics – 2013
Abstract
Prior research into learning translations from monolingual texts has treated the task as an unsupervised learning problem. Although many techniques take advantage of a seed bilingual lexicon, this work is the first to use that data for supervised learning to combine a diverse set of monolingual signals into a single discriminative model. Even in a low resource machine translation setting, where induced translations have the potential to improve performance substantially, it is reasonable to assume access to some amount of data to perform this kind of optimization. We report bilingual lexicon induction accuracies that are on average nearly 50% higher than an unsupervised baseline. Large gains in accuracy hold for all 22 languages (low and high resource) that we investigate.Statistical Machine Translation in Low Resource Settings
Ann Irvine
to appear in the Proceedings of the NAACL Student Research Workshop – 2013
SenseSpotting: Never let your parallel data tie you to an old domain
Marine Carpuat, Hal Daumé III, Katharine Henry, Ann Irvine, Jagadeesh Jagarlamudi and Rachel Rudinger
to appear in the Proceedings of the Association for Computational Linguistics – 2013
Abstract
Words often gain new senses in new domains. Being able to automatically identify, from a corpus of monolingual text, which word tokens are being used in a previously unseen sense has applications to machine translation and other tasks sensitive to lexical semantics. We define a task, SENSESPOTTING, in which we build systems to spot tokens that have new senses in new domain text. Instead of difficult and expensive annotation, we build a gold-standard by leveraging cheaply available parallel corpora, targeting our approach to the problem of domain adaptation for machine translation. Our system is able to achieve F-measures of as much as 80%, when applied to word types it has never seen before. Our approach is based on a large set of novel features that capture varied aspects of how words change when used in new domains.The (Un)faithful Machine Translator
Ruth Jones and Ann Irvine
to appear in the ACL Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. – 2013
2012
A Flexible Solver for Finite Arithmetic Circuits
Nathaniel Filardo and Jason Eisner
Technical Communications of the 28th International Conference on Logic Programming, ICLP 2012 – 2012
[pdf] [slides] [abstract] [bib]
Abstract
Arithmetic circuits arise in the context of weighted logic programming languages, such as Datalog with aggregation, or Dyna. A weighted logic program defines a generalized arithmetic circuit—the weighted version of a proof forest, with nodes having arbitrary rather than boolean values. In this paper, we focus on finite circuits. We present a flexible algorithm for efficiently querying node values as they change under updates to the circuit's inputs. Unlike traditional algorithms, ours is agnostic about which nodes are tabled (materialized), and can vary smoothly between the traditional strategies of forward and backward chaining. Our algorithm is designed to admit future generalizations, including cyclic and infinite circuits and propagation of delta updates.MAP Estimation of Whole-Word Acoustic Models with Dictionary Priors
Keith Kintzley, Aren Jansen and Hynek Hermansky
Proc. of INTERSPEECH – 2012
Abstract
The intrinsic advantages of whole-word acoustic modeling are offset by the problem of data sparsity. To address this, we present several parametric approaches to estimating intra-word phonetic timing models under the assumption that relative timing is independent of word duration. We show evidence that the timing of phonetic events is well described by the Gaussian distribution. We explore the construction of models in the absence of keyword examples (dictionary-based), when keyword examples are abundant (Gaussian mixture models), and also present a Bayesian approach which unifies the two. Applying these techniques in a point process model keyword spotting framework, we demonstrate a 55\% relative improvement in performance for models constructed from few examples.Inverting the Point Process Model for Fast Phonetic Keyword Search
Keith Kintzley, Aren Jansen, Kenneth Church and Hynek Hermansky
Proc. of INTERSPEECH – 2012
Abstract
Normally, we represent speech as a long sequence of frames and model the keyword with a relatively small set of parameters, commonly with a hidden Markov model (HMM). However, since the input speech is much longer than the keyword, suppose instead that we represent the speech as a relatively sparse set of impulses (roughly one per phoneme) and model the keyword as a filter-bank where each filter's impulse response relates to the likelihood of a phone at a given position within a word. Evaluating keyword detections can then be seen as a convolution of an impulse train with an array of filters. This view enables huge speedups; runtime no longer depends on the frame rate and is instead linear in the number of events (impulses). We apply this intuition to redesign the runtime engine behind the point process model for keyword spotting. We demonstrate impressive real-time speedups (500,000x faster than real-time) with minimal loss in search accuracy.Inverting the Point Process Model for Fast Phonetic Keyword Search
Keith Kintzley, Aren Jansen, Kenneth Church and Hynek Hermansky
Proc. of INTERSPEECH – 2012
Abstract
Normally, we represent speech as a long sequence of frames and model the keyword with a relatively small set of parameters, commonly with a hidden Markov model (HMM). However, since the input speech is much longer than the keyword, suppose instead that we represent the speech as a relatively sparse set of impulses (roughly one per phoneme) and model the keyword as a filter-bank where each filter's impulse response relates to the likelihood of a phone at a given position within a word. Evaluating keyword detections can then be seen as a convolution of an impulse train with an array of filters. This view enables huge speedups; runtime no longer depends on the frame rate and is instead linear in the number of events (impulses). We apply this intuition to redesign the runtime engine behind the point process model for keyword spotting. We demonstrate impressive real-time speedups (500,000x faster than real-time) with minimal loss in search accuracy.Name Phylogeny: A Generative Model of String Variation
Nicholas Andrews, Jason Eisner and Mark Dredze
Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) – 2012
Abstract
Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, "similar" strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.Findings of the 2012 Workshop on Statistical Machine Translation
Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut and Lucia Specia
Proceedings of the Seventh Workshop on Statistical Machine Translation – 2012
Abstract
This paper presents the results of the WMT12 shared tasks, which included a translation task, a task for machine translation evaluation metrics, and a task for run-time estimation of machine translation quality. We conducted a large-scale manual evaluation of 103 machine translation systems submitted by 34 teams. We used the ranking of these systems to mea- sure how strongly automatic metrics correlate with human judgments of translation quality for 12 evaluation metrics. We introduced a new quality estimation task this year, and evaluated submissions from 11 teams.Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing
Matt Post, Chris Callison-Burch and Miles Osborne
Proceedings of the Seventh Workshop on Statistical Machine Translation – 2012
Abstract
Recent work has established the efficacy of Amazon's Mechanical Turk for constructing parallel corpora for machine translation research. We apply this to building a collection of parallel corpora between English and six languages from the Indian subcontinent: Bengali, Hindi, Malayalam, Tamil, Telugu, and Urdu. These languages are low-resource, under-studied, and exhibit linguistic phenomena that are difficult for machine translation. We conduct a variety of baseline experiments and analysis, and release the data to the community.Using Categorial Grammar to Label Translation Rules
Jonathan Weese, Chris Callison-Burch and Adam Lopez
Proceedings of the Seventh Workshop on Statistical Machine Translation – 2012
Abstract
Adding syntactic labels to synchronous context-free translation rules can improve performance, but labeling with phrase structure constituents, as in GHKM (Galley et al., 2004), excludes potentially useful translation rules. SAMT (Zollmann and Venugopal, 2006) introduces heuristics to create new non-constituent labels, but these heuristics introduce many complex labels and tend to add rarely-applicable rules to the translation grammar. We introduce a labeling scheme based on categorial grammar, which allows syntactic labeling of many rules with a minimal, well-motivated label set. We show that our labeling scheme performs comparably to SAMT on an Urdu–English translation task, yet the label set is an order of magnitude smaller, and translation is twice as fast.Joshua 4.0: Packing, PRO, and Paraphrases
Juri Ganitkevitch, Yuan Cao, Jonathan Weese, Matt Post and Chris Callison-Burch
Proceedings of the Seventh Workshop on Statistical Machine Translation – 2012
Abstract
We present Joshua 4.0, the newest version of our open-source decoder for parsing-based statistical machine translation. The main contributions in this release are the introduction of a compact grammar representation based on packed tries, and the integration of our implementation of pairwise ranking optimization, J-PRO. We further present the extension of the Thrax SCFG grammar extractor to pivot-based extraction of syntactically informed sentential paraphrases.Monolingual Distributional Similarity for Text-to-Text Generation
Juri Ganitkevitch, Benjamin Van Durme and Chris Callison-Burch
*SEM First Joint Conference on Lexical and Computational Semantics – 2012
Abstract
Previous work on paraphrase extraction and application has relied on either parallel datasets, or on distributional similarity metrics over large text corpora. Our approach combines these two orthogonal sources of information and directly integrates them into our paraphrasing system’s log-linear model. We compare different distributional similarity feature-sets and show significant improvements in grammaticality and meaning retention on the example text-to-text generation task of sentence compression, achieving state-of-the-art quality.Machine Translation of Arabic Dialects
Rabih Zbib, Erika Malchiodi, Jacob Devlin, David Stallard, Spyros Matsoukas, Richard Schwartz, John Makhoul, Omar Zaidan and Chris Callison-Burch
The 2012 Conference of the North American Chapter of the Association for Computational Linguistics – 2012
Abstract
Arabic dialects present many challenges for machine translation, not least of which is the lack of data resources. We use crowdsourcing to cheaply and quickly build Levantine-English and Egyptian-English parallel corpora, consisting of 1.1M words and 380k words, respectively. The dialect sentences are selected from a large corpus of Arabic web text, and translated using Mechanical Turk. We use this data to build Dialect Arabic MT systems. Small amounts of dialect data have a dramatic impact on the quality of translation. When translating Egyptian and Levantine test sets, our Dialect Arabic MT system performs 5.8 and 6.8 BLEU points higher than a Modern Standard Arabic MT system trained on a 150 million word Arabic-English parallel corpus -- over 100 times the amount of data as our dialect corpora.Training and Evaluating a Statistical Part of Speech Tagger for Natural Language Applications using Kepler Workflows
Doug Briesch, Reginald Hobbs, Claire Jaja, Brian Kjersten and Clare Voss
Procedia Computer Science – 2012
Abstract
A core technology of natural language processing (NLP) incorporated into many text processing applications is a part of speech (POS) tagger, a software component that labels words in text with syntactic tags such as noun, verb, adjective, etc. These tags may then be used within more complex tasks such as parsing, question answering, and machine translation (MT). In this paper we describe the phases of our work training and evaluating statistical POS taggers on Arabic texts and their English translations using Kepler workflows. While the original objectives for encapsulating our research code within Kepler workflows were driven by software engineering needs to document and verify the re usability of our software, our research benefitted as well: the ease of rapid retraining and testing enabled our researchers to detect reporting discrepancies, document their source, independently validating the correct results.Annotated Gigaword
Courtney Napoles, Matt Gormley and Benjamin Van Durme
AKBC-WEKEX Workshop at NAACL 2012 – 2012
Cost-Sensitive Dynamic Feature Selection
He He, Hal Daume III and Jason Eisner
ICML Workshop on Inferning: Interactions between Inference and Learning – 2012
Abstract
We present an instance-specific test-time dynamic feature selection algorithm. Our algorithm sequentially chooses features given previously selected features and their values. It stops the selection process to make a prediction according to a user-specified accuracy-cost trade-off. We cast the sequential decision-making problem as a Markov Decision Process and apply imitation learning techniques. We address the problem of learning and inference jointly in a simple multiclass classification setting. Experimental results on UCI datasets show that our approach achieves the same or higher accuracy using only a small fraction of features than static feature selection methods.Fast and Accurate Prediction via Evidence-Specific MRF Structure
Veselin Stoyanov and Jason Eisner
ICML Workshop on Inferning: Interactions between Inference and Learning – 2012
Abstract
We are interested in speeding up approximate inference in Markov Random Fields (MRFs). We present a new method that uses gates—binary random variables that determine which factors of the MRF to use. Which gates are open depends on the observed evidence; when many gates are closed, the MRF takes on a sparser and faster structure that omits "unnecessary" factors. We train parameters that control the gates, jointly with the ordinary MRF parameters, in order to locally minimize an objective that combines loss and runtime.Implicitly Intersecting Weighted Automata using Dual Decomposition
Michael Paul and Jason Eisner
Proceedings of NAACL-HLT – 2012
Abstract
We propose an algorithm to find the best path through an intersection of arbitrarily many weighted automata, without actually performing the intersection. The algorithm is based on dual decomposition: the automata attempt to agree on a string by communicating about features of the string. We demonstrate the algorithm on the Steiner consensus string problem, both on synthetic data and on consensus decoding for speech recognition. This involves implicitly intersecting up to 100 automata.Unsupervised Learning on an Approximate Corpus
Jason Smith and Jason Eisner
Proceedings of NAACL-HLT – 2012
Abstract
Unsupervised learning techniques can take advantage of large amounts of unannotated text, but the largest text corpus (the Web) is not easy to use in its full form. Instead, we have statistics about this corpus in the form of n-gram counts (Brants and Franz, 2006). While n-gram counts do not directly provide sentences, a distribution over sentences can be estimated from them in the same way that n-gram language models are estimated. We treat this distribution over sentences as an approximate corpus and show how unsupervised learning can be performed on such a corpus using variational inference. We compare hidden Markov model (HMM) training on exact and approximate corpora of various sizes, measuring speed and accuracy on unsupervised part-of-speech tagging.Minimum-Risk Training of Approximate CRF-Based NLP Systems
Veselin Stoyanov and Jason Eisner
Proceedings of NAACL-HLT – 2012
Abstract
Conditional Random Fields (CRFs) are a popular formalism for structured prediction in NLP. It is well known how to train CRFs with certain topologies that admit exact inference, such as linear-chain CRFs. Some NLP phenomena, however, suggest CRFs with more complex topologies. Should such models be used, considering that they make exact inference intractable? Stoyanov et al. (2011) re- cently argued for training parameters to minimize the task-specific loss of whatever approximate inference and decoding methods will be used at test time. We apply their method to three NLP problems, showing that (i) using more complex CRFs leads to improved performance, and that (ii) minimum-risk training learns more accurate models.Learned Prioritization for Trading Off Accuracy and Speed
Jiarong Jiang, Adam Teichert, Hal Daume III and Jason Eisner
ICML Workshop on Inferning: Interactions between Inference and Learning – 2012
Abstract
Users want natural language processing (NLP) systems to be both fast and accurate, but quality often comes at the cost of speed. The field has been manually exploring various speed-accuracy tradeoffs for particular problems or datasets. We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing (Kay, 1986). Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is too large to explore naively. We propose a hybrid reinforcement/apprenticeship learning algorithm that, even with few inexpensive features, can automatically learn weights that achieve competitive accuracies at significant improvements in speed over state-of-the-art baselines.Shared Components Topic Models
Matt Gormley, Mark Dredze, Benjamin Van Durme and Jason Eisner
Proceedings of NAACL-HLT – 2012
Abstract
With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary. In this paper, we introduce the Shared Components Topic Model (SCTM), in which each topic is a normalized product of a smaller number of underlying component distributions. Our model learns these component distributions and the structure of how to combine subsets of them into topics. The SCTM can represent topics in a much more compact representation than LDA and achieves better perplexity with fewer parameters.Space Efficiencies in Discourse Modeling via Conditional Random Sampling
Brian Kjersten and Benjamin Van Durme
2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies – 2012
Abstract
Recent exploratory efforts in discourse-level language modeling have relied heavily on calculating Pointwise Mutual Information (PMI), which involves significant computation when done over large collections. Prior work has required aggressive pruning or independence assumptions to compute scores on large collections. We show the method of Conditional Random Sampling, thus far an underutilized technique, to be a space-efficient means of representing the sufficient statistics in discourse that underly recent PMI-based work. This is demonstrated in the context of inducing Shankian script-like structures over news articles.Stylometric Analysis of Scientific Articles
Shane Bergsma, Matt Post and David Yarowsky
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies – 2012
Tags: stylometry, syntax, | [abstract] [bib]
Abstract
We present an approach to automatically recover hidden attributes of scientific articles, such as whether the author is a native English speaker, whether the author is a male or a female, and whether the paper was published in a conference or workshop proceedings. We train classifiers to predict these attributes in computational linguistics papers. The classifiers perform well in this challenging domain, identifying non-native writing with 95% accuracy (over a baseline of 67%). We show the benefits of using syntactic features in stylometry; syntax leads to significant improvements over bag-of-words models on all three tasks, achieving 10% to 25% relative error reduction. We give a detailed analysis of which words and syntax most predict a particular attribute, and we show a strong correlation between our predictions and a paper’s number of citations.Judging Grammaticality with Count-Induced Tree Substitution Grammars
Francis Ferraro, Matt Post and Benjamin Van Durme
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP – 2012
Abstract
Prior work has shown the utility of syntactic tree fragments as features in judging the grammaticality of text. To date such fragments have been extracted from derivations of Bayesian-induced Tree Substitution Grammars (TSGs). Evaluating on discriminative coarse and fine grammaticality classification tasks, we show that a simple, deterministic, count-based approach to fragment identification performs on par with the more complicated grammars of Post (2011). This represents a significant reduction in complexity for those interested in the use of such fragments in the development of systems for the educational domain.Toward Tree Substitution Grammars with Latent Annotations
Francis Ferraro, Benjamin Van Durme and Matt Post
Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure – 2012
Abstract
We provide a model that extends the split-merge framework of Petrov et al. (2006) to jointly learn latent annotations and Tree Substitution Grammars (TSGs). We then conduct a variety of experiments with this model, first inducing grammars on a portion of the Penn Treebank and the Korean Treebank 2.0, and next experimenting with grammar refinement from a single nonterminal and from the Universal Part of Speech tagset. We present qualitative analysis showing promising signs across all experiments that our combined approach successfully provides for greater flexibility in grammar induction within the structured guidance provided by the treebank, leveraging the complementary natures of these two approaches.Toward Statistical Machine Translation without Parallel Corpora
Alex Klementiev, Ann Irvine, Chris Callison-Burch and David Yarowsky
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics – 2012
Abstract
We estimate the parameters of a phrase-based statistical machine translation system from monolingual corpora instead of a bilingual parallel corpus. We extend existing research on bilingual lexicon induction to estimate both lexical and phrasal translation probabilities for MT-scale phrase-tables. We propose a novel algorithm to estimate re-ordering probabilities from monolingual data. We report translation results for an end-to-end translation system using these monolingual features alone. Our method only requires monolingual corpora in source and target languages, a small bilingual dictionary, and a small bitext for tuning feature weights. In this paper, we examine an idealization where a phrase-table is given. We examine the degradation in translation performance when bilingually estimated translation probabilities are removed, and show that 82%+ of the loss can be recovered with monolingually estimated features alone. We further show that our monolingual features add 1.5 BLEU points when combined with standard bilingually estimated phrase table features.Learning Multivariate Distributions by Competitive Assembly of Marginals
Francisco Sanchez-Vega, Jason Eisner, Laurent Younes and Donald Geman
IEEE Transactions on Pattern Analysis and Machine Intelligence – 2012
Abstract
We present a new framework for learning high-dimensional multivariate probability distributions from estimated marginals. The approach is motivated by compositional models and Bayesian networks, and designed to adapt to small sample sizes. We start with a large, overlapping set of elementary statistical building blocks, or "primitives," which are low-dimensional marginal distributions learned from data. Each variable may appear in many primitives. Subsets of primitives are combined in a lego-like fashion to construct a probabilistic graphical model; only a small fraction of the primitives will participate in any valid construction. Since primitives can be precomputed, parameter estimation and structure search are separated. Model complexity is controlled by strong biases; we adapt the primitives to the amount of training data and impose rules which restrict the merging of them into allowable compositions. The likelihood of the data decomposes into a sum of local gains, one for each primitive in the final structure. We focus on a specific subclass of networks which are binary forests. Structure optimization corresponds to an integer linear program and the maximizing composition can be computed for reasonably large numbers of variables. Performance is evaluated using both synthetic data and real datasets from natural language processing and computational biology.Confidence-Weighted Linear Classification for Text Categorization
Koby Crammer, Mark Dredze and Fernando Pereira
2012
Abstract
Confidence-weighted online learning is a generalization of margin-based learning of linear classifiers in which the margin constraint is replaced by a probabilistic constraint based on a distribution over classifier weights that is updated online as examples are observed. The distribution captures a notion of confidence on classifier weights, and in some cases it can also be interpreted as replacing a single learning rate by adaptive per-weight rates. Confidence-weighted learning was motivated by the statistical properties of natural language classification tasks, where most of the informative features are relatively rare. We investigate several versions of confidence-weighted learning that use a Gaussian distribution over weight vectors, updated at each observed example to achieve high probability of correct classification for the example. Empirical evaluation on a range of text-categorization tasks show that our algorithms improve over other state-of-the-art online and batch methods, learn faster in the online setting, and lead to better classifier combination for a type of distributed training commonly used in cloud computing.Entity Clustering Across Languages
Spence Green, Nicholas Andrews, Matt Gormley, Mark Dredze and Christopher Manning
NAACL – 2012
New H-Infinity Bounds for the Recursive Least Squares Algorithm Exploiting Input Structure
Koby Crammer, Alex Kulesza and Mark Dredze
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) – 2012
Use of Modality and Negation in Semantically-Informed Syntactic MT
Kathryn Baker, Bonnie Dorr, Michael Bloodgood, Chris Callison-Burch, Nathaniel Filardo, Christine Piatko, Lori Levin and Scott Miller
Computational Linguistics – 2012
Abstract
This article describes the resource- and system-building efforts of an eight-week JHU Human Language Technology Center of Excellence Summer Camp for Applied Language Exploration (SCALE-2009) on Semantically-Informed Machine Translation (SIMT). We describe a new modality/negation (MN) annotation scheme, a (publicly available) MN lexicon, and two au- tomated MN taggers that we built using the annotation scheme and lexicon. Our annotation scheme isolates three components of modality and negation: a trigger (a word that conveys modality or negation), a target (an action associated with modality or negation) and a holder (an experiencer of modality). We describe how our MN lexicon was produced semi-automatically and we demonstrate that a structure-based MN tagger results in precision around 86% (depending on genre) for tagging of a standard LDC data set. We apply our MN annotation scheme to statistical machine translation using a syntactic framework that supports the inclusion of semantic annotations. Syntactic tags enriched with semantic annotations are assigned to parse trees in the target-language training texts through a process of tree grafting. While the focus of our work is modality and negation, the tree grafting procedure is general and supports other types of semantic information. We exploit this capability by including named entities, produced by a pre-existing tagger, in addition to the MN elements produced by the taggers described in this paper. The resulting system significantly outperformed a linguistically naïve baseline model (Hiero), and reached the highest scores yet reported on the NIST 2009 Urdu-English test set. This finding supports the hypothesis that both syntactic and semantic information can improve translation quality.Processing Informal, Romanized Pakistani Text Messages
Ann Irvine, Jonathan Weese and Chris Callison-Burch
Proceedings of the NAACL Workshop on Language in Social Media – 2012
Abstract
Regardless of language, the standard character set for text messages (SMS) and many other social media platforms is the Roman alphabet. There are romanization conventions for some character sets, but they are used inconsistently in informal text, such as SMS. In this work, we convert informal, romanized Urdu messages into the native Arabic script and normalize non-standard SMS language. Doing so prepares the messages for existing downstream processing tools, such as machine translation, which are typically trained on well-formed, native script text. Our model combines information at the word and character levels, allowing it to handle out-of-vocabulary items. Compared with a baseline deterministic approach, our system reduces both word and character error rate by over 50%.Digitizing 18th-Century French Literature: Comparing transcription methods for a critical edition text
Ann Irvine, Laure Marcellesi and Afra Zomorodian
Proceedings of the NAACL Workshop on Computational Linguistics for Literature – 2012
Abstract
We compare four methods for transcribing early printed texts. Our comparison is through a case-study of digitizing an eighteenth-century French novel for a new critical edition: the 1784 Lettres taïtiennes by Joséphine de Monbart. We provide a detailed error analysis of transcription by optical character recognition (OCR), non-expert humans, and expert humans and weigh each technique based on accuracy, speed, cost and the need for scholarly overhead. Our findings are relevant to 18th-century French scholars as well as the entire community of scholars working to preserve, present, and revitalize interest in literature published before the digital age.Expectations of Word Sense in Parallel Corpora
Xuchen Yao, Benjamin Van Durme and Chris Callison-Burch
NAACL – 2012
Semantics-based Question Generation and Implementation
Xuchen Yao, Gosse Bouma and Zhaonian Zhang
Dialogue and Discourse, Special Issue on Question Generation – 2012
Sample Selection for Large-scale MT Discriminative Training
Yuan Cao and Sanjeev Khudanpur
Proceedings of the Annual Conference of the Association for Machine Translation in the Americas(AMTA) – 2012
2011
Arabic Optical Character Recognition (OCR) Evaluation in Order to Develop a Post-OCR Module
Brian Kjersten
2011
Abstract
Optical character recognition (OCR) is the process of converting an image of a document into text. While progress in OCR research has enabled low error rates for English text in low-noise images, performance is still poor for noisy images and documents in other languages. We intend to create a post-OCR processing module for noisy Arabic documents which can correct OCR errors before passing the resulting Arabic text to a translation system. To this end, we are evaluating an Arabic-script OCR engine on documents with the same content but varying levels of image quality. We have found that OCR text accuracy can be improved with different stages of pre-OCR image processing: (1) filtering out low-contrast images to avoid hallucination of characters, (2) removing marks from images with cleanup software to prevent their misrecognition, and (3) zoning multi-column images with segmentation software to enable recognition of all zones. The specific errors observed in OCR will form the basis of training data for our post-OCR correction module.Using Visual Information to Predict Lexical Preference
Shane Bergsma and Randy Goebel
Proc. RANLP – 2011
Event Selection from Phone Posteriorgrams Using Matched Filters
Keith Kintzley, Aren Jansen and Hynek Hermansky
Proc. of INTERSPEECH – 2011
Abstract
In this paper we address the issue of how to select a minimal set of phonetic events from a phone posteriorgram while minimizing the loss of information. We derive phone posteriorgrams from two sources, Gaussian mixture models and sparse multilayer perceptrons, and apply phone-specific matched filters to the posteriorgrams to yield a smaller set of phonetic events. We introduce a mutual information based performance measure to compare phonetic event selection techniques and demonstrate that events extracted using matched filters can reduce input data while significantly improving performance of an event-based keyword spotting systemLearning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
Juri Ganitkevitch, Chris Callison-Burch, Courtney Napoles and Benjamin Van Durme
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing – 2011
Abstract
Previous work has shown that high quality phrasal paraphrases can be extracted from bilingual parallel corpora. However, it is not clear whether bitexts are an appropriate resource for extracting more sophisticated sentential paraphrases, which are more obviously learnable from monolingual parallel corpora. We extend bilingual paraphrase extraction to syntactic paraphrases and demonstrate its ability to learn a variety of general paraphrastic transformations, including passivization, dative shift, and topicalization. We discuss how our model can be adapted to many text generation tasks by augmenting its feature set, development data, and parameter estimation routine. We illustrate this adaptation by using our paraphrase model for the task of sentence compression and achieve results competitive with state-of-the-art compression systems.author = {Doug Briesch and Reginald Hobbs and Claire Jaja and Kjersten, Brian and Clare Voss},
title = {Training and Evaluating a Statistical Part of Speech Tagger for Natural Language Applications using Kepler Workflows},
pages = {1588 - 1594},
url = {http://www.sciencedirect.com/science/article/pii/S1877050912002955},
abstract = {A core technology of natural language processing (NLP) incorporated into many text processing applications is a part of speech (POS) tagger, a software component that labels words in text with syntactic tags such as noun, verb, adjective, etc. These tags may then be used within more complex tasks such as parsing, question answering, and machine translation (MT). In this paper we describe the phases of our work training and evaluating statistical POS taggers on Arabic texts and their English translations using Kepler workflows. While the original objectives for encapsulating our research code within Kepler workflows were driven by software engineering needs to document and verify the re usability of our software, our research benefitted as well: the ease of rapid retraining and testing enabled our researchers to detect reporting discrepancies, document their source, independently validating the correct results.}
}
Joshua 3.0: Syntax-based Machine Translation with the Thrax Grammar Extractor
Jonathan Weese, Juri Ganitkevitch, Chris Callison-Burch, Matt Post and Adam Lopez
Proceedings of the Sixth Workshop on Statistical Machine Translation – 2011
Abstract
We present progress on Joshua, an open source decoder for hierarchical and syntax-based machine translation. The main focus is describing Thrax, a flexible, open source synchronous context-free grammar extractor. Thrax extracts both hierarchical (Chiang, 2007) and syntax-augmented machine translation (Zollmann and Venugopal, 2006) grammars. It is built on Apache Hadoop for efficient distributed performance, and can easily be extended with support for new grammars, feature functions, and output formats.author = {He He and Hal Daume III and Eisner, Jason},
title = {Cost-Sensitive Dynamic Feature Selection},
booktitle = {ICML Workshop on Inferning: Interactions between Inference and Learning},
url = {http://cs.jhu.edu/~jason/papers/#icmlw12-dynfeat},
abstract = {We present an instance-specific test-time dynamic feature selection algorithm. Our algorithm sequentially chooses features given previously selected features and their values. It stops the selection process to make a prediction according to a user-specified accuracy-cost trade-off. We cast the sequential decision-making problem as a Markov Decision Process and apply imitation learning techniques. We address the problem of learning and inference jointly in a simple multiclass classification setting. Experimental results on UCI datasets show that our approach achieves the same or higher accuracy using only a small fraction of features than static feature selection methods.}
}
Findings of the 2011 Workshop on Statistical Machine Translation
Chris Callison-Burch, Philipp Koehn, Christof Monz and Omar Zaidan
Proceedings of the Sixth Workshop on Statistical Machine Translation – 2011
Abstract
This paper presents the results of the WMT11 shared tasks, which included a translation task, a system combination task, and a task for machine translation evaluation metrics. We conducted a large-scale manual evaluation of 148 machine translation systems and 41 system combination entries. We used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality for 21 evaluation metrics. This year featured a Haitian Creole to English task translating SMS messages sent to an emergency response service in the aftermath of the Haitian earthquake. We also conducted a pilot ‘tunable metrics’ task to test whether optimizing a fixed system to different metrics would result in perceptibly different translation quality.You Are What You Tweet : Analyzing Twitter for Public Health
Michael Paul and Mark Dredze
5th Interational Conference on Weblogs and Social Media – 2011
Learning Bilingual Lexicons using the Visual Similarity of Labeled Web Images
Shane Bergsma and Benjamin Van Durme
Proc. IJCAI – 2011
Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation
Jason Riesa, Ann Irvine and Daniel Marcu
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing – 2011
Abstract
We present an accurate word alignment algorithm that heavily exploits source and target-language syntax. Using a discriminative framework and an efficient bottom-up search algorithm, we train a model of hundreds of thousands of syntactic features. Out new model (1) helps us to very accurately model syntactic transformations between languages; (2) is language-independent; and (3) with automatic feature extraction, assists system developers in obtaining good word-alignment performance off-the-shelf when tackling new language pairs. We analyze the impact of our features, describe inference under the model, and demonstrate significant alignment and translation quality improvements over already-powerful baselines trained on very large corpora. We observe translation quality improvements corresponding to 1.0 and 1.3 BLEU for Arabic-English and Chinese-English, respectively.Reranking Bilingually Extracted Paraphrases Using Monolingual Distributional Similarity
Charley Chan, Chris Callison-Burch and Benjamin Van Durme
Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics – 2011
Abstract
This paper improves an existing bilingual paraphrase extraction technique using monolingual distributional similarity to rerank candidate paraphrases. Raw monolingual data provides a complementary and orthogonal source of information that lessens the commonly observed errors in bilingual pivot-based methods. Our experiments reveal that monolingual scoring of bilingually extracted paraphrases has a significantly stronger correlation with human judgment for grammaticality than the probabilities assigned by the bilingual pivoting method does. The results also show that monolingual distribution similarity can serve as a threshold for high precision paraphrase selection.Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model
Markus Dreyer and Jason Eisner
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) – 2011
Abstract
We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted finite-state transducers with language-specific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50-100 seed paradigms, adding a 10-million-word corpus reduces prediction error for morphological inflections by up to 10%.WikiTopics: What is Popular on Wikipedia and Why
Byung Gyu Ahn, Benjamin Van Durme and Chris Callison-Burch
Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages – 2011
Abstract
We establish a novel task in the spirit of news summarization and topic detection and tracking (TDT): daily determination of the topics newly popular with Wikipedia readers. Central to this effort is a new public dataset consisting of the hourly page view statistics of all Wikipedia articles over the last three years. We give baseline results for the tasks of: discovering individual pages of interest, clustering these pages into coherent topics, and extracting the most relevant summarizing sentence for the reader. When compared to human judgements, our system shows the viability of this task, and opens the door to a range of exciting future work.author = {Stoyanov, Veselin and Eisner, Jason},
title = {Fast and Accurate Prediction via Evidence-Specific MRF Structure},
booktitle = {ICML Workshop on Inferning: Interactions between Inference and Learning},
url = {http://cs.jhu.edu/~jason/papers/#icmlw12-gates},
abstract = {We are interested in speeding up approximate inference in Markov Random Fields (MRFs). We present a new method that uses gates—binary random variables that determine which factors of the MRF to use. Which gates are open depends on the observed evidence; when many gates are closed, the MRF takes on a sparser and faster structure that omits "unnecessary" factors. We train parameters that control the gates, jointly with the ordinary MRF parameters, in order to locally minimize an objective that combines loss and runtime.}
}
Evaluating Sentence Compression: Pitfalls and Suggested Remedies
Courtney Napoles, Benjamin Van Durme and Chris Callison-Burch
Proceedings of the Workshop on Monolingual Text-To-Text Generation – 2011
Abstract
This work surveys existing evaluation methodologies for the task of sentence compression, identifies their shortcomings, and proposes alternatives. In particular, we examine the problems of evaluating paraphrastic compression and comparing the output of different models. We demonstrate that compression rate is a strong predictor of compression quality and that perceived improvement over other models is often a side effect of producing longer output.author = {Paul, Michael and Eisner, Jason},
title = {Implicitly Intersecting Weighted Automata using Dual Decomposition},
booktitle = {Proceedings of NAACL-HLT},
pages = {232--242},
url = {http://cs.jhu.edu/~jason/papers/#naacl12-dd},
abstract = {We propose an algorithm to find the best path through an intersection of arbitrarily many weighted automata, without actually performing the intersection. The algorithm is based on dual decomposition: the automata attempt to agree on a string by communicating about features of the string. We demonstrate the algorithm on the Steiner consensus string problem, both on synthetic data and on consensus decoding for speech recognition. This involves implicitly intersecting up to 100 automata.}
}
Paraphrastic Sentence Compression with a Character-based Metric: Tightening without Deletion
Courtney Napoles, Chris Callison-Burch, Juri Ganitkevitch and Benjamin Van Durme
Proceedings of the Workshop on Monolingual Text-To-Text Generation – 2011
Tags: paraphrasing | [abstract] [bib]
Abstract
We present a substitution-only approach to sentence compression which “tightens†a sentence by reducing its character length. Replacing phrases with shorter paraphrases yields paraphrastic compressions as short as 60\% of the original length. In support of this task, we introduce a novel technique for re-ranking paraphrases extracted from bilingual corpora. At high compression rates1 paraphrastic compressions outperform a state-of-the-art deletion model in an oracle experiment. For further compression, deleting from oracle paraphrastic compressions preserves more meaning than deletion alone. In either setting, paraphrastic compression shows promise for surpassing deletion-only methods.author = {Smith, Jason and Eisner, Jason},
title = {Unsupervised Learning on an Approximate Corpus},
booktitle = {Proceedings of NAACL-HLT},
pages = {131--141},
url = {http://cs.jhu.edu/~jason/papers/#naacl12-ngram},
abstract = {Unsupervised learning techniques can take advantage of large amounts of unannotated text, but the largest text corpus (the Web) is not easy to use in its full form. Instead, we have statistics about this corpus in the form of n-gram counts (Brants and Franz, 2006). While n-gram counts do not directly provide sentences, a distribution over sentences can be estimated from them in the same way that n-gram language models are estimated. We treat this distribution over sentences as an approximate corpus and show how unsupervised learning can be performed on such a corpus using variational inference. We compare hidden Markov model (HMM) training on exact and approximate corpora of various sizes, measuring speed and accuracy on unsupervised part-of-speech tagging.}
}
Paraphrase Fragment Extraction from Monolingual Comparable Corpora
Rui Wang and Chris Callison-Burch
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web – 2011
Abstract
We present a novel paraphrase fragment pair extraction method that uses a monolingual comparable corpus containing different articles about the same topics or events. The procedure consists of document pair extraction, sentence pair extraction, and fragment pair extraction. At each stage, we evaluate the intermediate results manually, and tune the later stages accordingly. With this minimally supervised approach, we achieve 62% of accuracy on the paraphrase fragment pairs we collected and 67% extracted from the MSR corpus. The results look promising, given the minimal supervision of the approach, which can be further scaled up.author = {Stoyanov, Veselin and Eisner, Jason},
title = {Minimum-Risk Training of Approximate CRF-Based NLP Systems},
booktitle = {Proceedings of NAACL-HLT},
pages = {120--130},
url = {http://cs.jhu.edu/~jason/papers/#naacl12-risk},
abstract = {Conditional Random Fields (CRFs) are a popular formalism for structured prediction in NLP. It is well known how to train CRFs with certain topologies that admit exact inference, such as linear-chain CRFs. Some NLP phenomena, however, suggest CRFs with more complex topologies. Should such models be used, considering that they make exact inference intractable? Stoyanov et al. (2011) re- cently argued for training parameters to minimize the task-specific loss of whatever approximate inference and decoding methods will be used at test time. We apply their method to three NLP problems, showing that (i) using more complex CRFs leads to improved performance, and that (ii) minimum-risk training learns more accurate models.}
}
The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content
Omar Zaidan and Chris Callison-Burch
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – 2011
Abstract
The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native†languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal content, and we describe our long-term annotation effort to identify the dialect level (and dialect itself) in each sentence of the dataset. So far, we have labeled 108K sentences, 41% of which as having dialectal content. We also present experimental results on the task of automatic dialect identification, using the collected labels for training and evaluation.author = {Jiarong Jiang and Teichert, Adam and Hal Daume III and Eisner, Jason},
title = {Learned Prioritization for Trading Off Accuracy and Speed},
booktitle = {ICML Workshop on Inferning: Interactions between Inference and Learning},
url = {http://cs.jhu.edu/~jason/papers/#icmlw12-ldp},
abstract = {Users want natural language processing (NLP) systems to be both fast and accurate, but quality often comes at the cost of speed. The field has been manually exploring various speed-accuracy tradeoffs for particular problems or datasets. We aim to explore this space automatically, focusing here on the case of agenda-based syntactic parsing (Kay, 1986). Unfortunately, off-the-shelf reinforcement learning techniques fail to learn good policies: the state space is too large to explore naively. We propose a hybrid reinforcement/apprenticeship learning algorithm that, even with few inexpensive features, can automatically learn weights that achieve competitive accuracies at significant improvements in speed over state-of-the-art baselines.}
}
Crowdsourcing Translation: Professional Quality from Non-Professionals
Omar Zaidan and Chris Callison-Burch
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – 2011
Abstract
Naively collecting translations by crowdsourcing the task to non-professional translators yields disfluent, low-quality results if no quality control is exercised. We demonstrate a variety of mechanisms that increase the translation quality to near professional levels. Specifically, we solicit redundant translations and edits to them, and automatically select the best output among them. We propose a set of features that model both the translations and the translators, such as country of residence, LM perplexity of the translation, edit rate from the other translations, and (optionally) calibration against professional translators. Using these features to score the collected translations, we are able to discriminate between acceptable and unacceptable translations. We recreate the NIST 2009 Urdu-toEnglish evaluation set with Mechanical Turk, and quantitatively show that our models are able to select translations within the range of quality that we expect from professional translators. The total cost is more than an order of magnitude lower than professional translation.author = {Gormley, Matt and Dredze, Mark and Van Durme, Benjamin and Eisner, Jason},
title = {Shared Components Topic Models},
booktitle = {Proceedings of NAACL-HLT},
pages = {783--792},
url = {http://cs.jhu.edu/~jason/papers/#naacl12-sctm},
abstract = {With a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary. In this paper, we introduce the Shared Components Topic Model (SCTM), in which each topic is a normalized product of a smaller number of underlying component distributions. Our model learns these component distributions and the structure of how to combine subsets of them into topics. The SCTM can represent topics in a much more compact representation than LDA and achieves better perplexity with fewer parameters.}
}
Incremental Syntactic Language Models for Phrase-based Translation
Lane Schwartz, Chris Callison-Burch, William Schuler and Stephen Wu
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – 2011
Abstract
This paper describes a novel technique for incorporating syntactic knowledge into phrasebased machine translation through incremental syntactic parsing. Bottom-up and topdown parsers typically require a completed string as input. This requirement makes it difficult to incorporate them into phrase-based translation, which generates partial hypothesized translations from left-to-right. Incremental syntactic language models score sentences in a similar left-to-right fashion, and are therefore a good mechanism for incorporating syntax into phrase-based translation. We give a formal definition of one such lineartime syntactic language model, detail its relation to phrase-based decoding, and integrate the model with the Moses phrase-based translation system. We present empirical results on a constrained Urdu-English translation task that demonstrate a significant BLEU score improvement and a large decrease in perplexity.author = {Kjersten, Brian and Van Durme, Benjamin},
title = {Space Efficiencies in Discourse Modeling via Conditional Random Sampling},
booktitle = {2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
address = {Montreal, Canada},
publisher = {Association for Computational Linguistics},
pages = {513-517},
url = {http://www.aclweb.org/anthology/N/N12/N12-1056.pdf},
abstract = {Recent exploratory efforts in discourse-level language modeling have relied heavily on calculating Pointwise Mutual Information (PMI), which involves significant computation when done over large collections. Prior work has required aggressive pruning or independence assumptions to compute scores on large collections. We show the method of Conditional Random Sampling, thus far an underutilized technique, to be a space-efficient means of representing the sufficient statistics in discourse that underly recent PMI-based work. This is demonstrated in the context of inducing Shankian script-like structures over news articles.}
}
Nonparametric Bayesian Word Sense Induction
Xuchen Yao and Benjamin Van Durme
Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing – 2011
Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
Shane Bergsma, David Yarowsky and Kenneth Church
Proc. ACL – 2011
Joint Training of Dependency Parsing Filters through Latent Support Vector Machines
Colin Cherry and Shane Bergsma
Proc. ACL – 2011
Judging Grammaticality with Tree Substitution Grammar Derivations
Matt Post
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – 2011
Tags: text classification, grammaticality, tree substitution grammar | [abstract] [bib]
Abstract
In this paper, we show that local features computed from the derivations of tree substitution grammars – such as the identify of particular fragments, and a count of large and small fragments – are useful in binary grammatical classification tasks. Such features outperform n-gram features and various model scores by a wide margin. Although they fall short of the performance of the hand-crafted feature set of Charniak and Johnson (2005) developed for parse tree reranking, they do so with an order of magnitude fewer features. Furthermore, since the TSGs employed are learned in a Bayesian setting, the use of their derivations can be viewed as the automatic discovery of tree patterns useful for classification. On the BLLIP dataset, we achieve an accuracy of 89.9% in discriminating between grammatical text and samples from an n-gram language model.Variational Approximation of Long-Span Language Models for LVCSR
Anoop Deoras, Tomáš Mikolov, Stefan Kombrink, Martin Karafiát and Sanjeev Khudanpur
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing – 2011
Abstract
Long-span language models that capture syntax and semantics are seldom used in the first pass of large vocabulary continuous speech recognition systems due to the prohibitive search-space of sentence-hypotheses. Instead, an N-best list of hypotheses is created using tractable n-gram models, and rescored using the long-span models. It is shown in this paper that computationally tractable variational approximations of the long-span models are a better choice than standard ra-gram models for first pass decoding. They not only result in a better first pass output, but also produce a lattice with a lower oracle word error rate, and rescoring the N-best list from such lattices with the long-span models requires a smaller N to attain the same accuracy. Empirical results on the WSJ, MIT Lectures, NIST 2007 Meeting Recognition and NIST 2001 Conversational Telephone Recognition data sets are presented to support these claims.Extensions of Recurrent Neural Network Language Model
Tomas Mikolov, Stefan Kombrink, Lukas Burget, Jan Cernocky and Sanjeev Khudanpur
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing – 2011
Abstract
We present several modifications of the original recurrent neural net work language model (RNN LM). While this model has been shown to significantly outperform many competitive language modeling techniques in terms of accuracy, the remaining problem is the computational complexity. In this work, we show approaches that lead to more than 15 times speedup for both training and testing phases. Next, we show importance of using a backpropagation through time algorithm. An empirical comparison with feedforward networks is also provided. In the end, we discuss possibilities how to reduce the amount of parameters in the model. The resulting RNN model can thus be smaller, faster both during training and testing, and more accurate than the basic one.Hill Climbing on Speech Lattices: A New Rescoring Framework
Ariya Rastrow, Markus Dreyer, Abhinav Sethy, Sanjeev Khudanpur, Bhuvana Ramabhadran and Mark Dredze
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing – 2011
Abstract
We describe a new approach for rescoring speech lattices - with long-span language models or wide-context acoustic models - that does not entail computationally intensive lattice expansion or limited rescoring of only an N-best list. We view the set of word-sequences in a lattice as a discrete space equipped with the edit-distance metric, and develop a hill climbing technique to start with, say, the 1-best hypothesis under the lattice-generating model(s) and iteratively search a local neighborhood for the highest-scoring hypothesis under the rescoring model(s); such neighborhoods are efficiently constructed via finite state techniques. We demonstrate empirically that to achieve the same reduction in error rate using a better estimated, higher order language model, our technique evaluates fewer utterance-length hypotheses than conventional N-best rescoring by two orders of magnitude. For the same number of hypotheses evaluated, our technique results in a significantly lower error rate.Learning and Inference Algorithms for Partially-Observed Structured Switching Vector Autoregressive Models
Balakrishnan Varadarajan and Sanjeev Khudanpur
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing – 2011
Abstract
We present learning and inference algorithms for a versatile class of partially observed vector autoregressive (VAR) models for multivariate time-series data. VAR models can capture wide variety of temporal dynamics in a continuous multidimensional signal. Given a sequence of observations to be modeled by a VAR model, it is possible to estimate its parameters in closed form by solving a least squares problem. For high dimensional observations, the state space representation of a linear system is often invoked. One advantage of doing so is that we model the dynamics of a low dimensional hidden state instead of the observations, which results in robust estimation of the dynamical system parameters. The commonly used approach is to project the high dimensional observation to the low dimensional state space using a KL transform. In this article, we propose a novel approach to automatically discover the low dimensional dynamics in a switching VAR model by imposing discriminative structure on the model parameters. We demonstrate its efficacy via significant improvements in gesture recognition accuracy over a standard hidden Markov model, which does not take the state-conditional dynamics of the observations into account, on a bench-top suturing task.Dirichlet Mixtures to Model Neural Netwok Posteriors in the HMM Framework
Balakrishnan Varadarajan, Sri Garimella and Sanjeev Khudanpur
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing – 2011
Abstract
In this paper, we present a novel technique for modeling the posterior probability estimates obtained from a neural net work directly in the HMM framework using the Dirichlet Mixture Models (DMMs). Since posterior probability vectors lie on a probability simplex their distribution can be modeled using DMMs. Being in an exponential family, the parameters of DMMs can be estimated in an efficient manner. Conventional approaches like TANDEM attempt to gaussianize the posteriors by suitable transforms and model them using Gaussian Mixture Models (GMMs). This requires more number of parameters as it does not exploit the fact that the probability vectors lie on a simplex. We demonstrate through TIMIT phoneme recognition experiments that the proposed technique outperforms the conventional TANDEM approach.Generating More Specific Questions
Xuchen Yao
AAAI Symposium on Question Generation – 2011
NADA: A Robust System for Non-Referential Pronoun Detection
Shane Bergsma and David Yarowsky
Proc. DAARC – 2011
Johns Hopkins on the chip: microsystems and cognitive machines for sustainable, affordable, personalized medicine and health care (invited paper)
Andreas G Andreou
2011
Abstract
Semiconductor technology is contributing to the advancement of bio- technology, medicine and healthcare delivery in ways that it was never envisioned - from chip micro-arrays, to scientific grade CMOS imagers and ion sensing arrays to implantable prosthesis. This expo- nential growth of sensory microsystems has led to an exponential growth of data. Cognitive machines, i.e. advanced computer architectures and algorithms, are carefully co-designed to extract knowledge from such health data making rational decisions and recommendations for therapies. Nano, micro and macro robotics driven by sophisticated algorithms interface to the human body at different levels and scales, from nano-scale molecules to micron-scale cells to networks and all the way to the scale of organisms. The present era is one where semi- conductor technology and the 'chip' is the foundation of sustainable and affordable personalised medicine and healthcare delivery.Language Models for Semantic Extraction and Filtering in Video Action Recognition
Evelyne Tzoukermann, Jan Neumann, Jana Kosecka, Cornelia Fermuller, Ian Perera, Francis Ferraro, Benjamin Sapp, Rizwan Chaudry and Gautam Singh
AAAI Workshop on Language-Action Tools for Cognitive Artificial Agents – 2011
Recognizing Manipulation Actions in Arts and Crafts Shows using Domain Specific Visual and Textual Cues
Benjamin Sapp, Rizwan Chaudry, Xiaodong Yu, Gautam Singh, Ian Perera, Francis Ferraro, Evelyne Tzoukermann, Jana Kosecka and Jan Neumann
The 3rd International Workshop on Video Event Categorization, Tagging and Retrieval for Real-World Applications (VECTaR2011) – 2011
author = {Bergsma, Shane and Yarowsky, David},
title = {NADA: A Robust System for Non-Referential Pronoun Detection},
booktitle = {Proc. DAARC},
address = {Faro, Portugal}
}
Beyond Amdahl's law: An objective function that links multiprocessor performance gains to delay and energy
Andrew S Cassidy and Andreas G Andreou
2011
Minimum Imputed Risk Unsupervised Discriminative Training for Machine Translation
Zhifei Li, Ziyuan Wang, Jason Eisner, Sanjeev Khudanpur and Brian Roark
Proceedings of the 2011 Conference on Empirical Methods Natural Language Processing – 2011
Abstract
Discriminative training for machine translation has been well studied in the recent past. A limitation of the work to date is that it relies on the availability of high-quality in-domain bilingual text for supervised training. We present an unsupervised discriminative training framework to incorporate the usually plentiful target-language monolingual data by using a rough “reverse†translation system. Intuitively, our method strives to ensure that probabilistic “round-trip†translation from a targetlanguage sentence to the source-language and back will have low expected loss. Theoretically, this may be justiï¬ed as (discriminatively) minimizing an imputed empirical risk. Empirically, we demonstrate that augmenting supervised training with unsupervised data improves translation performance over the supervised case for both IWSLT and NIST tasks.Unsupervised Arabic Dialect Adaptation with Self Training
Scott Novotney, Rich Schwartz and Sanjeev Khudanpur
Proceedings of the 12th Annual Conference of the International Speech Communication Association – 2011
Abstract
Useful training data for automatic speech recognition systems of colloquial speech is usually limited to expensive in-domain transcription. Broadcast news is an appealing source of easily available data to bootstrap into a new dialect. However, some languages, like Arabic, have deep linguistic differences resulting in poor cross domain performance. If no in-domain transcripts are available, but a large amount of indomain audio is, self-training may be a suitable technique to bootstrap into the domain. In this work, we attempt to adapt Modern Standard Arabic (MSA) models to Levantine Arabic without any in-domain manual transcription. We contrast with varying amounts of in-domain transcription and show that 1) Self-training is effective with only one hour of indomain transcripts. 2) Self-training is not a suitable solution to improve strong MSA models on Levantine. 3) Two metrics that quantify model bias predict self-training success. 4) Model bias explains the failure of self-training to adapt across strong domain mismatch.Efficient Subsampling for Training Complex Language Models
Puyang Xu, Asela Gunawardana and Sanjeev Khudanpur
Proceedings of the 2011 Conference on Empirical Methods Natural Language Processing – 2011
Abstract
We propose an efï¬cient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efï¬cient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classiï¬er predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacriï¬cing predictive performance. Empirical results show that we can train MELM and NNLM at 1\% ∼ 5\% of the standard complexity with no loss in performance.Learning Speed-Accuracy Tradeoffs in Nondeterministic Inference Algorithms
Jason Eisner and Hal Daumé III
COST: NIPS 2011 Workshop on Computational Trade-offs in Statistical Learning – 2011
Abstract
Could we explicitly train test-time inference heuristics to trade off accuracy and efficiency? We focus our discussion on agenda-based natural language parsing under a weighted context-free grammar. We frame the problem as reinforcement learning, discuss its special properties, and propose new strategies.Human action categorization using ultrasound micro-Doppler signatures
Salvador Dura-Bernal, Guillaume Garreau, Charalambos Andreou, Andreas G Andreou, Julius Georgiou, Thomas Wennekers and Susan Denham
2011
A high-level analytical model for application specific CMP design exploration
Andrew S Cassidy, Kai Yu, Haolang Zhou and Andreas G Andreou
2011
Bio-Inspired Cognitive Analysis for Active and Passive Acoustic Sensors
Andreas G Andreou
2011
Design of a one million neuron single FPGA neuromorphic system for real-time multimodal scene analysis
Andrew S Cassidy and Andreas G Andreou
45th Annual Conference on Information Sciences and Systems (CISS 2011) – 2011
A multimodal-corpus data collection system for cognitive acoustic scene analysis
Julius Georgiou, Philippe O Pouliquen, Andrew S Cassidy, Guillaume Garreau, Charalambos Andreou, Guillermo Stuarts, Cyrlle d'Urbal, Susan Denham, Thomas Wennekers, Robert Mill, Istvan Winkler, Tamas Bohm, Orsolya Szalardy, Georg Klump, Simon Jones, Alexandra Bendixen and Andreas G Andreou
2011
Confusion Network Decoding for MT System Combination
Antti-Veikko Rosti, Eugene Matusov, Jason Smith, Necip Ayan, Jason Eisner, Damianos Karakos, Sanjeev Khudanpur, Gregor Leusch, Zhifei Li, Spyros Matsoukas, Hermann Ney, Richard Schwartz, B. Zhang and J. Zheng
Handbook of Natural Language Processing and Machine Translation – 2011
Forest Reranking for Machine Translation Using the Direct Translation Model
Zhi Li and Sanjeev Khudanpur
Handbook of Natural Language Processing and Machine Translation – 2011
Stepwise Optimal Subspace Pursuit for Improving Sparse Recovery
Balakrishnan Varadarajan, Sanjeev Khudanpur and Trac Tran
the IEEE Signal Processing Letters – 2011
Abstract
We propose a new iterative algorithm to reconstruct an unknown sparse signal x from a set of projected measurements y = Φx . Unlike existing methods, which rely crucially on the near orthogonality of the sampling matrix Φ , our approach makes stepwise optimal updates even when the columns of Φ are not orthogonal. We invoke a block-wise matrix inversion formula to obtain a closed-form expression for the increase (reduction) in the L2-norm of the residue obtained by removing (adding) a single element from (to) the presumed support of x . We then use this expression to design a computationally tractable algorithm to search for the nonzero components of x . We show that compared to currently popular sparsity seeking matching pursuit algorithms, each step of the proposed algorithm is locally optimal with respect to the actual objective function. We demonstrate experimentally that the algorithm significantly outperforms conventional techniques in recovering sparse signals whose nonzero values have exponentially decaying magnitudes or are distributed N(0,1) .Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure
Veselin Stoyanov, Alexander Ropson and Jason Eisner
Proceedings of AISTATS – 2011
Abstract
Graphical models are often used "inappropriately," with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of the entire imperfect system. We show how to locally optimize this risk using back-propagation and stochastic metadescent. Over a range of synthetic-data problems, compared to the usual practice of choosing approximate MAP parameters, our approach significantly reduces loss on test data, sometimes by an order of magnitude.Estimating Document Frequencies in a Speech Corpus
Damianos Karakos, Mark Dredze, Kenneth Church, Aren Jansen and Sanjeev Khudanpur
IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) – 2011
Abstract
Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df(w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task.Adapting N-Gram Maximum Entropy Language Models with Conditional Entropy Regularization
Ariya Rastrow, Mark Dredze and Sanjeev Khudanpur
IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) – 2011
Abstract
Accurate estimates of language model parameters are critical for building quality text generation systems, such as automatic speech recognition. However, text training data for a domain of interest is often unavailable. Instead, we use semi-supervised model adaptation; parameters are estimated using both unlabeled in-domain data (raw speech audio) and labeled out of domain data (text.) In this work, we present a new semi-supervised language model adaptation procedure for Maximum Entropy models with n-gram features. We augment the conventional maximum likelihood training criterion on out-of- domain text data with an additional term to minimize conditional entropy on in-domain audio. Additionally, we demonstrate how to compute conditional entropy efficiently on speech lattices using first- and second-order expectation semirings. We demonstrate improvements in terms of word error rate over other adaptation techniques when adapting a maximum entropy language model from broadcast news to MIT lectures.Efficient Discrimnative Training of Long-Span Language Models
Ariya Rastrow, Mark Dredze and Sanjeev Khudanpur
IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) – 2011
Abstract
Long-span language models, such as those involving syntactic dependencies, produce more coherent text than their n-gram counterparts. However, evaluating the large number of sentence-hypotheses in a packed representation such as an ASR lattice is intractable under such long-span models both during decoding and discriminative training. The accepted compromise is to rescore only the N-best hypotheses in the lattice using the long-span LM. We present discriminative hill climbing, an efficient and effective discriminative training procedure for long- span LMs based on a hill climbing rescoring algorithm. We empirically demonstrate significant computational savings as well as error-rate reduction over N-best training methods in a state of the art ASR system for Broadcast News transcription.Entity Linking: Finding Extracted Entities in a Knowledge Base
Delip Rao, Paul McNamee and Mark Dredze
Multi-source, Multi-lingual Information Extraction and Summarization – 2011
Abstract
In the menagerie of tasks for information extraction, entity linking is a new beast that has drawn a lot of attention from NLP practitioners and researchers recently. Entity Linking, also referred to as record linkage or entity resolution, involves aligning a textual mention of a named-entity to an appropriate entry in a knowledge base, which may or may not contain the entity. This has manifold applications ranging from linking patient health records to maintaining personal credit files, prevention of identity crimes, and supporting law enforcement. We discuss the key challenges present in this task and we present a high-performing system that links entities using max-margin ranking. We also summarize recent work in this area and describe several open research problems.Optimality Theory Syntax Learnability: An Empirical Exploration of the Perceptron and GLA
Ann Irvine, Mark Dredze, Geraldine Legendre and Paul Smolensky
CogSci Workshop on OT as a General Cognitive Architecture – 2011
Abstract
This work brings together several threads of research on Optimality Theory (OT) and Harmonic Grammar (HG) learnability. As noted in previous work, including Pater (2008) and Magri (2010), the perceptron learning algorithm is well-established in the Machine Learning field and is a natural choice for modeling human grammar acquisition. The algorithm learns from one observation at a time, and it is capable of learning from a noisy corpus of observed natural language. In this work, we use the perceptron algorithm to learn a model that specifies a set of constraint weights relevant to one syntax phenomenon, Czech word order. We extract training data (sentences annotated with grammatical and information structure and their surface word orders) from the Prague Dependency Treebank (Hajic et al., 2001) and use basic alignment (edge-most) constraints on grammatical and information structure to predict the surface order of the subject, verb, and object. The perceptron algorithm learns a set of numeric, weighted constraints (a Harmonic Grammar). Ordering the constraints by the magnitude of their weights may specify a hierarchical constraint ranking (an OT Grammar), which is the essence of the classic Gradual Learning Algorithm (GLA) (Boersma, 1997). We describe and compare the two learning algorithms in detail and use a held out set of empirical data to quantitatively evaluate each. We show that by allowing for so-called ganging-up-effects, the more expressive Harmonic Grammar models Czech Word Order more accurately than the GLA OT grammar. Finally, crucially, it is also capable of modeling variation in production.OOV Sensitive Named-Entity Recognition in Speech
Carolina Parada, Mark Dredze and Frederick Jelinek
International Speech Communication Association (INTERSPEECH) – 2011
Abstract
Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named entities and always produce transcription errors. In this work, we improve speech NER by including features indicative of OOVs based on a OOV detector, allowing for the identification of regions of speech containing named entities, even if they are incorrectly transcribed. We construct a new speech NER data set and demonstrate significant improvements for this task.A Model for Mining Public Health Topics from Twitter
Michael Paul and Mark Dredze
2011
Abstract
We present the Ailment Topic Aspect Model (ATAM), a new topic model for Twitter that associates symptoms, treatments and general words with diseases (ailments). We train ATAM on a new collection of 1.6 million tweets discussing numerous health related topics. ATAM isolates more coherent ailments, such as influenza, infections, obesity, as compared to standard topic models. Furthermore, ATAM matches influenza tracking results produced by Google Flu Trends and previous influenza specialized Twitter models compared with government public health data.You Are What You Tweet: Analyzing Twitter for Public Health
Michael Paul and Mark Dredze
International Conference on Weblogs and Social Media (ICWSM) – 2011
Abstract
Analyzing user messages in social media can mea- sure different population haracteristics, including public health measures. For example, recent work has correlated Twitter messages with influenza rates in the United States; but this has largely been the extent of mining Twitter for public health. In this work, we consider a broader range of public health applications for Twitter. We apply the recently introduced Ailment Topic Aspect Model to over one and a half million health related tweets and discover mentions of over a dozen ailments, including allergies, obesity and in- somnia. We introduce extensions to incorporate prior knowledge into this model and apply it to several tasks: tracking illnesses over times (syndromic surveillance), measuring behavioral risk factors, localizing illnesses by geographic region, and analyzing symptoms and medication usage. We show quantitative correlations with public health data and qualitative evaluations of model output. Our results suggest that Twitter has broad applicability for public health research.Learning Sub-Word Units for Open Vocabulary Speech Recognition
Carolina Parada, Mark Dredze, Abhinav Sethy and Ariya Rastrow
Association for Computational Linguistics (ACL) – 2011
Abstract
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of sub-word units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to \em learn the sub-word lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. %We present results on a Broadcast News and MIT Lectures data sets. A hybrid model with our learned sub-word lexicon reduces error by 6.3\% and 7.6\% (absolute) at a 5\% false alarm rate on an English Broadcast News and MIT Lectures task respectively.Training a Log-Linear Parser with Loss Functions via Softmax-Margin
Michael Auli and Adam Lopez
Proc. of EMNLP – 2011
Abstract
Log-linear parsing models are often trained by optimizing likelihood, but we would prefer to optimize for a task-specific metric like F-measure. Softmax-margin is a convex objective for such models that minimizes a bound on expected risk for a given loss function, but its naïve application requires the loss to decompose over the predicted structure, which is not true of F-measure. We use softmax-margin to optimize a log-linear CCG parser for a variety of loss functions, and demonstrate a novel dynamic programming algorithm that enables us to use it with F-measure, leading to substantial gains in accuracy on CCGBank. When we embed our loss-trained parser into a larger model that includes supertagging features incorporated via belief propagation, we obtain further improvements and achieve a labelled/unlabelled dependency F-measure of 89.3%/94.0% on gold part-of-speech tags, and 87.2%/92.8% on automatic part-of-speech tags, the best reported results for this task.A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing
Michael Auli and Adam Lopez
Proc. of ACL – 2011
Abstract
Via an oracle experiment, we show that the upper bound on accuracy of a CCG parser is significantly lowered when its search space is pruned using a supertagger, though the supertagger also prunes many bad parses. Inspired by this analysis, we design a single model with both supertagging and parsing features, rather than separating them into distinct models chained together in a pipeline. To overcome the resulting increase in complexity, we experiment with both belief propagation and dual decomposition approaches to inference, the first empirical comparison of these algorithms that we are aware of on a structured natural language processing problem. On CCGbank we achieve a labelled dependency F-measure of 88.8% on gold POS tags, and 86.7% on automatic part-of-speeoch tags, the best reported results for this task.Efficient CCG Parsing: A* versus Adaptive Supertagging
Michael Auli and Adam Lopez
Proc. of ACL – 2011
Abstract
We present a systematic comparison and combination of two orthogonal techniques for efficient parsing of Combinatory Categorial Grammar (CCG). First we consider adaptive supertagging, a widely used approximate search technique that prunes most lexical categories from the parser's search space using a separate sequence model. Next we consider several variants on A*, a classic exact search technique which to our knowledge has not been applied to more expressive grammar formalisms like CCG. In addition to standard hardware-independent measures of parser effort we also present what we believe is the first evaluation of A* parsing on the more realistic but more stringent metric of CPU time. By itself, A* substantially reduces parser effort as measured by the number of edges considered during parsing, but we show that for CCG this does not always correspond to improvements in CPU time over a CKY baseline. Combining A* with adaptive supertagging decreases CPU time by 15% for our best model.The value of monolingual crowdsourcing in a real-world translation scenario: Simulation using Haitian Creole emergency SMS messages
Sanjeev Khudanpur, P. Resnik, Y. Kronrod, V. Eidelman, Olivia Buzek and B.B. Bederson
2011
News Personalization using Support Vector Machines
Anatole Gershman, Travis Wolfe, Eugene Fink and Jaime Carbonell
2011
Abstract
We describe a system for recommending news articles, called NewsPer, which learns news-reading preferences of its users and suggests recently published articles that may be of interest to specific readers based on their interest profiles. The underlying algorithm is based on representing articles by bags of words and named entities, and applying support vector machines to this representation. We present this algorithm and give initial empirical results. We also discuss broader issues in the news personalization and the challenges of performance evaluation based on historical dataData-driven and feedback based spectro-temporal features for speech recognition
G.S.V.S. Sivaram, Sridhar Krishna Nemala, Nima Mesgarani and Hynek Hermansky
2011
Displaying 1 - 100 of 626 total matches

Additional Information
Memorandum Report number ARL-MR-0798