Significant progress has been made in natural language processing (NLP) technologies in recent years, but most still do not match human performance. Since many applications of these technologies require human-quality results, some form of manual intervention is necessary.
The success of such applications therefore depends heavily on the extent to which errors can be automatically detected and signaled to a human user. In our project we will attempt to devise a generic method for NLP error detection by studying the problem of Confidence Estimation (CE) in NLP results within a Machine Learning (ML) framework.
Although widely used in Automatic Speech Recognition (ASR) applications, this approach has not yet been extensively pursued in other areas of NLP. In ASR, error recovery is entirely based on confidence measures: results with a low level of confidence are rejected and the user is asked to repeat his or her statement. We argue that a large number of other NLP applications could benefit from such an approach. For instance, when post-editing MT output, a human translator could revise only those automatic translations that have a high probability of being wrong. Apart from improving user interactions, CE methods could also be used to improve the underlying technologies. For example, bootstrap learning could be based on outputs with a high confidence level, and NLP output re-scoring could depend on probabilities of correctness.
Our basic approach will be to use a statistical Machine Learning (ML) framework to post-process NLP results: an additional ML layer will be trained to discriminate between correct and incorrect NLP results and compute a confidence measure (CM) that is an estimate of the probability of an output being correct. We will test this approach on a statistical MT application using a very strong baseline MT system. Specifically, we will start off with the same training corpus (Chinese-English data from recent NIST evaluations), and baseline system as the Syntax for Statistical Machine Translation team.
During the workshop we will investigate a variety of confidence features and test their effects on the discriminative power of our CM using Receiver Operating Characteristic (ROC) curves. We will investigate features intended to capture the amount of overlap, or consensus, among the system’s n-best translation hypotheses, features focusing on the reliability of estimates from the training corpus, ones intended to capture the inherent difficulty of the source sentence under translation, and those that exploit information from the base statistical MT system. Other themes for investigation include a comparison of different ML frameworks such as Neural Nets or Support Vector Machines, and a determination of the optimal granularity for confidence estimates (sentence-level, word-level, etc).
Two methods will be used to evaluate final results. First, we will perform a re-scoring experiment where the n-best translation alternatives output by the baseline system will be re-ordered according to their confidence estimates. The results will be measured using the standard automatic evaluation metric BLEU, and should be directly comparable to those obtained by the Syntax for Statistical Machine Translation team. We expect this to lead to many insights about the differences between our approach and theirs. Another method of evaluation will be to estimate the tradeoff between final translation quality and amount of human effort invested, in a simulated post-editing scenario.
|George Foster||University of Montreal|
|Simona Gandrabur||University of Montreal|
|Alberto Sanchis||University of Valencia|
|Nicola Ueffing||RWTH Aachen|