Complementary Evaluation Measures for Speech Transcription
The classical performance measure of a speech recognizer, a.k.a. speech transcriber, has been word-error rate (WER). This measure dates back to the days when ASR was regarded as a task in its own right, and the goal was simply nothing more than to aim for the same perfect transcriptions that humans can (in principle) generate from listening to speech. We know better now: it seems, in fact, very unlikely that anyone would be interested in reading as much as a single-page transcript of colloquial, spontaneous speech, even if it were transcribed perfectly. What people want is to search through speech, summarize speech, translate speech, etc. And our computers' memory capacity is now capable of storing large amounts of digitized audio to make these derivative tasks more direct.
Underneath the hood of any one of these "real" tasks, however, is a speech recognizer, or at least some components of one, which generates word hypotheses that are numerically scored. Even very flawed transcripts are nevertheless a very valuable source of features on which to train spoken language processing applications, even if we would be too embarrassed to show them to anyone. How do we evaluate the issue of these components? According to recent HCI research, WER simply does not work. Dumouchel showed that manually correcting a transcript with more than 20% WER is actually harder than starting over from scratch, and yet Munteanu et al. showed that transcripts with WERs of as much as 25% are statistically significantly better than not having a transcript on a common lecture-browsing task for university students. Transcripts with WERs as bad as 46% have proven to be a useful source of features for speech summarization systems, at least according to the very flawed standards of current summarization evaluations, but it is also clear that those same standards often label poor summaries as very good because of the lack of a higher-level organization or goal orientation that people expect from summaries, and it remains unclear the extent to which WER affects this. University of Toronto's computational linguistics lab, which specializes in HCI-style experimental design for spoken language interfaces, is currently conducting a large human-subject study of speech summarizers, in order to evaluate summary quality in a more ecologically valid fashion.
With this experience at hand, this proposed workshop would focus on measures of transcript quality that are complementary to WER in the ASR- and summarization-related tasks of: (1) rapid message assessment, in which an accurate gist of a small spoken message must be formed in order to make a rapid decision, such as in military and emergency rescue contexts; and (2) decision-support for meetings, in which very interactive spoken negotiations between multiple participants must be distilled into a set of promises, deliverables and dates. Our intention is to bring our experience with human-subject experimentation on ASR applications together with recent advances in semantic distance measures as well as statistical parsing to formulate complementary objective functions to WER that can be computed without human-subject trials and employed to turn around better message-assessment and decision-support systems through periodic testing on development data.
Recent work on alternative metrics for statistical machine translation could be cited as parallel to the present proposal, but there is an important distinction. The recent SMT metrics work has focussed on a principle that, in HCI research, is called _construct validity_. BLEU scores have no construct validity, it is claimed, because an SMT development team can "game" an evaluation by seeking better BLEU scores even if it adversely affects the true quality of the translation (whatever that is). Our proposal seeks to remedy a more fundamental problem with speech transcription evaluation that is sometimes called _ecological validity_. Stipulating for the moment that WER is the perfect construct for measuring success in speech transcription, the current value of ASR systems does not derive from their success at generating transcriptions at any level of quality. Not only do we have the wrong measure, but we're also measuring the wrong task. Rapid message assessment and decision-support for meetings are ecologically valid tasks. (In case you were wondering, there has been no serious consideration of how ecologically valid our evaluation of automatically translated documents is either, but the recent work on MT metrics has addressed more acutely perceived concerns about the construct invalidity of BLEU scores).
The current format of JHU workshops makes it extremely difficult to conduct any sort of corroborative human-subject trial during the actual six-week workshop itself, so we anticipate that during the six-month period prior to the event, workshop participants would pool together speech corpora in the two above-mentioned domains together with a few state-of-the-art ASR systems in order to collect transcripts of varying WER rates. Again prior to the workshop, we would then conduct the human-subject trials necessary to establish an ecologically valid gold standard for human participation in these tasks that would serve, among other purposes, to differentiate transcripts of roughly the same WER that are not at all the same with respect to how well they enable the task at hand. The six-week period of the workshop would then be devoted to experimentation with objective functions based on NLP techniques, as well as improvement of an existing system in at least one of these two tasks, to demonstrate the benefit of this alternative evaluation scheme on an actual piece of spoken language technology.
|Benoit Favre||Universite' de Marseilles|
|Gerald Penn||University of Toronto|
|Stephen Tratz||Department of Defense|
|Clare Voss||Department of Defense|
|Siavash Kazemian||University of Toronto|
|Adam Lee||City University of New York|
|Kyla Cheung||Columbia University|
|Dennis Ochei||Duke University|
|Yang Liu||University of Texas at Dallas|
|Cosmin Munteanu||National Research Council Canada|
|Ani Nenkova||University of Pennsylvania|
|Frauke Zeller||Wilfried Laurier University|