Phones, Anti-Phones, and Diphones: A Probabilistic Framework for Feature-Based Speech Recognition – Jim Glass (MIT)
Abstract
In most current speech recognizers, the observation space of an utterance consists of a temporal sequence of “frames”. An important property of this framework is that every segmentation of the input utterance accounts for all of the observations. In contrast, in a “feature”-based framework based on segments (either implicit or explicit) each segment is represented by a fixed-dimensional feature vector, so that alternative segmentations of the utterance will consist of different observations.In this work, we have developed a probabilistic framework which allows us to compare different paths by considering the entire observation space of features. The approach we have adopted is to add an extra lexical unit which is defined to map to all segments which do not correspond to one of the existing units. In our phonetic-based modeling, we call this unit the anti-phone, and use it to model all sounds which are not a phonetic unit as they are too large, too small, overlapping etc. Two competing paths must therefore account for all segments, either as a normal acoustic-phonetic unit or as the anti-phone. It can be shown that this approach can be implemented while only considering the segments in a particular segmentation. This is done by normalizing the phonetic likelihood of a particular segment by the likelihood of the anti-phone for that same segment.In this talk, I will describe the feature-based framework we have developed, show how it is currently being used in the MIT SUMMIT speech recognition system, and discuss several phonetic recognition experiments using the TIMIT acoustic-phonetic corpus where we are able to achieve context-independent and dependent phonetic recognition accuracies of 64.5% and 69.5% respectively.