Monolingual Distributional Similarity for Text-to-Text Generation
Juri Ganitkevitch, Benjamin Van Durme and Chris Callison-Burch
*SEM First Joint Conference on Lexical and Computational Semantics – 2012
AbstractPrevious work on paraphrase extraction and application has relied on either parallel datasets, or on distributional similarity metrics over large text corpora. Our approach combines these two orthogonal sources of information and directly integrates them into our paraphrasing system’s log-linear model. We compare different distributional similarity feature-sets and show significant improvements in grammaticality and meaning retention on the example text-to-text generation task of sentence compression, achieving state-of-the-art quality.
AbstractWith a few exceptions, extensions to latent Dirichlet allocation (LDA) have focused on the distribution over topics for each document. Much less attention has been given to the underlying structure of the topics themselves. As a result, most topic models generate topics independently from a single underlying distribution and require millions of parameters, in the form of multinomial distributions over the vocabulary. In this paper, we introduce the Shared Components Topic Model (SCTM), in which each topic is a normalized product of a smaller number of underlying component distributions. Our model learns these component distributions and the structure of how to combine subsets of them into topics. The SCTM can represent topics in a much more compact representation than LDA and achieves better perplexity with fewer parameters.
Space Efﬁciencies in Discourse Modeling via Conditional Random Sampling
Brian Kjersten and Benjamin Van Durme
2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies – 2012
AbstractRecent exploratory efforts in discourse-level language modeling have relied heavily on calculating Pointwise Mutual Information (PMI), which involves significant computation when done over large collections. Prior work has required aggressive pruning or independence assumptions to compute scores on large collections. We show the method of Conditional Random Sampling, thus far an underutilized technique, to be a space-efficient means of representing the sufficient statistics in discourse that underly recent PMI-based work. This is demonstrated in the context of inducing Shankian script-like structures over news articles.
Judging Grammaticality with Count-Induced Tree Substitution Grammars
Francis Ferraro, Matt Post and Benjamin Van Durme
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP – 2012
AbstractPrior work has shown the utility of syntactic tree fragments as features in judging the grammaticality of text. To date such fragments have been extracted from derivations of Bayesian-induced Tree Substitution Grammars (TSGs). Evaluating on discriminative coarse and fine grammaticality classification tasks, we show that a simple, deterministic, count-based approach to fragment identification performs on par with the more complicated grammars of Post (2011). This represents a significant reduction in complexity for those interested in the use of such fragments in the development of systems for the educational domain.
Toward Tree Substitution Grammars with Latent Annotations
Francis Ferraro, Benjamin Van Durme and Matt Post
Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure – 2012
AbstractWe provide a model that extends the split-merge framework of Petrov et al. (2006) to jointly learn latent annotations and Tree Substitution Grammars (TSGs). We then conduct a variety of experiments with this model, first inducing grammars on a portion of the Penn Treebank and the Korean Treebank 2.0, and next experimenting with grammar refinement from a single nonterminal and from the Universal Part of Speech tagset. We present qualitative analysis showing promising signs across all experiments that our combined approach successfully provides for greater flexibility in grammar induction within the structured guidance provided by the treebank, leveraging the complementary natures of these two approaches.
Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
Juri Ganitkevitch, Chris Callison-Burch, Courtney Napoles and Benjamin Van Durme
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing – 2011
AbstractPrevious work has shown that high quality phrasal paraphrases can be extracted from bilingual parallel corpora. However, it is not clear whether bitexts are an appropriate resource for extracting more sophisticated sentential paraphrases, which are more obviously learnable from monolingual parallel corpora. We extend bilingual paraphrase extraction to syntactic paraphrases and demonstrate its ability to learn a variety of general paraphrastic transformations, including passivization, dative shift, and topicalization. We discuss how our model can be adapted to many text generation tasks by augmenting its feature set, development data, and parameter estimation routine. We illustrate this adaptation by using our paraphrase model for the task of sentence compression and achieve results competitive with state-of-the-art compression systems.
Learning Bilingual Lexicons using the Visual Similarity of Labeled Web Images
Shane Bergsma and Benjamin Van Durme
Proc. IJCAI – 2011
Reranking Bilingually Extracted Paraphrases Using Monolingual Distributional Similarity
Charley Chan, Chris Callison-Burch and Benjamin Van Durme
Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics – 2011
AbstractThis paper improves an existing bilingual paraphrase extraction technique using monolingual distributional similarity to rerank candidate paraphrases. Raw monolingual data provides a complementary and orthogonal source of information that lessens the commonly observed errors in bilingual pivot-based methods. Our experiments reveal that monolingual scoring of bilingually extracted paraphrases has a significantly stronger correlation with human judgment for grammaticality than the probabilities assigned by the bilingual pivoting method does. The results also show that monolingual distribution similarity can serve as a threshold for high precision paraphrase selection.
WikiTopics: What is Popular on Wikipedia and Why
Byung Gyu Ahn, Benjamin Van Durme and Chris Callison-Burch
Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages – 2011
AbstractWe establish a novel task in the spirit of news summarization and topic detection and tracking (TDT): daily determination of the topics newly popular with Wikipedia readers. Central to this effort is a new public dataset consisting of the hourly page view statistics of all Wikipedia articles over the last three years. We give baseline results for the tasks of: discovering individual pages of interest, clustering these pages into coherent topics, and extracting the most relevant summarizing sentence for the reader. When compared to human judgements, our system shows the viability of this task, and opens the door to a range of exciting future work.
Evaluating Sentence Compression: Pitfalls and Suggested Remedies
Courtney Napoles, Benjamin Van Durme and Chris Callison-Burch
Proceedings of the Workshop on Monolingual Text-To-Text Generation – 2011
AbstractThis work surveys existing evaluation methodologies for the task of sentence compression, identifies their shortcomings, and proposes alternatives. In particular, we examine the problems of evaluating paraphrastic compression and comparing the output of different models. We demonstrate that compression rate is a strong predictor of compression quality and that perceived improvement over other models is often a side effect of producing longer output.
Paraphrastic Sentence Compression with a Character-based Metric: Tightening without Deletion
Courtney Napoles, Chris Callison-Burch, Juri Ganitkevitch and Benjamin Van Durme
Proceedings of the Workshop on Monolingual Text-To-Text Generation – 2011
AbstractWe present a substitution-only approach to sentence compression which â€œtightensâ€ a sentence by reducing its character length. Replacing phrases with shorter paraphrases yields paraphrastic compressions as short as 60\% of the original length. In support of this task, we introduce a novel technique for re-ranking paraphrases extracted from bilingual corpora. At high compression rates1 paraphrastic compressions outperform a state-of-the-art deletion model in an oracle experiment. For further compression, deleting from oracle paraphrastic compressions preserves more meaning than deletion alone. In either setting, paraphrastic compression shows promise for surpassing deletion-only methods.
Shared Components Topic Models with Application to Selectional Preference
Matt Gormley, Mark Dredze, Benjamin Van Durme and Jason Eisner
NIPS 2011 Workshop on Learning Semantics – 2011
AbstractLatent Dirichlet Allocation (LDA) has been used to learn selectional preferences as soft disjunctions over flat semantic classes. Our model, the SCTM, also learns the structure of each class as a soft conjunction of high-level semantic features.
Efficient Spoken Term Discovery Using Randomized Algorithms
Aren Jansen and Benjamin Van Durme
ASRU – 2011
Streaming Pointwise Mutual Information
Benjamin Van Durme and Ashwin Lall
Advances in Neural Information Processing Systems 22 – 2009
Displaying 1 - 17 of 17 total matches