| P. Beyerlein1, W. Byrne2, J. M. Huerta3, S. Khudanpur2, B. Marthi4, |
| J. Morgan5, N. Peterek6, J. Picone7, W. Wang8 |
Presented at the 1999 IEEE Workshop on Automatic Speech
Recognition and Understanding
Philips Research Laboratories (1) CLSP, Johns Hopkins University (2)
Dept. ECE, Carnegie Melon University (3) Depts. CS and Math, University of Toronto (4)
Dept. Foreign Languages, USAMA, West Point (5) UFAL, Charles University, Prague (6)
ISIP, Mississippi State University (7) Dept. ECE, Rice University (8)
Language independent acoustic modeling was one of the topics studied at the 1999 Johns Hopkins University Language Engineering Workshop hosted by the Center for Language and Speech Processing. Our work was motivated by the need for speech recognition in languages beyond the well-studied languages of Europe, Asia, and the Americas. The statistical techniques used for speech and language modeling require relatively large amounts of monolingual speech and text as training data. In the `resource-rich' languages which have such corpora, these statistical methods have been shown to work quite well. However, if only small amounts of training data are available in a language, these monolingual techniques are less effective. Our goal was to address this problem by developing techniques that reduce the amount of data needed to model resource-poor languages by borrowing data and models from resource-rich languages.
While in our studies we used multiple languages simultaneously, our goal was not to build a `multilingual' ASR system capable of recognizing several languages equally well. We intended instead to develop a good monolingual system for a specified target language by borrowing data and models from other languages. This is called `language independent acoustic modeling' to suggest a similarity in nature to speaker independent modeling. In the current state-of-the-art, speaker independent models are first trained from multiple speakers and then adapted to a specific speaker either before or during recognition. Analogously, language independent modeling is a methodology that combines speech and models from multiple source languages and transforms them for recognition in a specific target language.
As mentioned above, acoustic training data is only one resource needed for statistical ASR. However, we have assumed that language models, pronunciations, and appropriate acoustic processing are available for the target language, and that only transcribed acoustic training data is in short supply. This is not a completely unrealistic scenario, however, in that dictionaries with pronunciations are available for many languages, as are on-line newspapers and other text. However, we stress that we address here only one aspect of language independent modeling.
We have developed methods to share data and acoustic models between languages. Underlying these methods are `phone mappings' that describe the similarity of sounds in two different languages. We obtain these phone mappings using both knowledge-based and automatic methods. The knowledge-based methods rely only on acoustic-phonetic phonetic categorizations of the individual languages and as such can be used if no data at all is available in the target language. The automatic methods derive phone mappings using small amounts of acoustic data in the target language. By either approach we can borrow models from several languages simultaneously to cover the phone inventory of the target language. The automatic methods allow additional refinement by borrowing models sub-phonetically at the HMM-state level. This can be especially valuable if the target language contains phones not found in any of the source languages since these techniques are free to assemble a new phone model from component states of different source language phone models.
While both the automatic and knowledge-based phone mappings can be used without modification to construct recognizers in the target language by borrowing acoustic models from the various source languages, HMM adaptation techniques can also be used to improve the systems using the small amount of target language adaptation data we assume is available. As a further refinement, we obtained the best recognition performance not from individually adapted source language acoustic models but by using Discriminative Model Combination (DMC) to combine models from several languages simultaneously. This combination can be done at the sentence or sub-word level, with better performance obtained using phone-level combinations. We note in particular that DMC makes effective use of source language acoustic models that by themselves do not perform well in transcribing the target language.
We present below a necessarily brief description of our experiments. Our web site www.clsp.jhu.edu/ws99/projects/asr contains complete documentation of our work, some of the language data and models used, and a more extensive bibliography of prior work in language independent and multilingual acoustic modeling.
We chose Czech language Voice of America (VOA) broadcasts as our test domain since news broadcasts contain a variety of different types of speech and are relatively easy to obtain. We chose Czech since we have ongoing projects [2] from which we could borrow resources. We also felt that studying Czech as a rapid-porting task was realistic since, unlike Spanish or Mandarin, there is fairly little knowledge of existing Czech ASR to influence our work. Our final test set consisted of one week of news broadcasts, although due to evolution of our experiments, not all the numbers reported below are directly comparable; see our web site for more detailed reporting.
As our out-of-domain acoustic training data, we used broadcast news recordings in English, Spanish, and Mandarin obtained from the Linguistic Data Consortium. We also used read Russian speech collected at West Point for computer aided foreign language instruction and read Czech speech from the Charles University Corpus of Financial News (CUCFN). All speech was down-sampled to 16KHz as needed. The acoustic models were trained from mel-frequency, cepstral data using HTK [6]. Unless otherwise noted, the source language acoustic models were monophone systems to simplify cross-language mapping; full system descriptions are on our web site.
We built our initial Czech broadcast news system from a ten hour Czech VOA acoustic training set using techniques known to work well in other languages and domains. The language model and pronouncing dictionary were taken from our previous work [2]. After obtaining the performance of this well-trained system, we reduced drastically the size of the acoustic training set and retrained new, impoverished acoustic models. Given our past experience and the reported experience of others, we expected that training a system using approximately one hour of acoustic training data would yield an ASR system that performed substantially worse than the initial, well-trained 10 hour system. We would then attempt to improve this impoverished system by borrowing from other languages. However, as Table 1 shows, performance on Czech VOA is relatively good despite large variations in training set size and model complexity. This behavior appears to be due to the extremely regular and careful speech used by Czech VOA announcers and not due to a preponderance of speech by individual news anchors or other obvious similarities between training and test sets. We note that we observed similar behavior in experiments with Spanish VOA broadcasts.
| Training Data | Model type | WER (%) |
| 12.8 hour | 12 mixture, cross-word triphone | 27.1 |
| 10.0 hour | 20 mixture, monophone | 27.6 |
| 1.0 hour | 8 mixture, monophone | 30.2 |
| 0.5 hour | 20 mixture, monophone | 31.3 |
From these results we concluded that the Czech VOA speech was too self-similar to be used as both training and test data. We therefore investigated a cross-domain training scenario in which a small amount of read speech from the CUCFN corpus would serve as the Czech language training data. After comparing performance across the mono-lingual Czech read and broadcast domains (Table 2), we decided to fix the 1.0 hour CUCFN read speech training set as the Czech language acoustic training set and to attempt to improve performance on the Czech VOA test data by borrowing from English, Mandarin, Spanish and Russian. This provides a realistic and interesting training scenario that involves cross-domain as well as multilingual factors.
| Training Set | CUCFN | VOA |
| 1.0 hr VOA | 66.1% | 28.8% |
| 1.0 hr CUCFN | 47.3% | 35.7% |
These experiments with Czech VOA are reported as a cautionary note to emphasize that language is just one characteristic of speech and that other conditions, such as speaking style, are significant factors in ASR performance. It is therefore critically important to obtain diverse training and test sets for multilingual experiments. It is also important that results of limited domain experiments, such as training and testing with data from the same news programs, be interpreted cautiously since performance may not carry over to more diverse domains.
Our initial experiments involved simple mappings in which phones from the Czech target language were mapped to their nearest neighbor in a single source language using a similarity measure based on feature-based descriptions of the phones. This is a manual procedure that leverages extensive knowledge of acoustic phonetics [3]. Our approach involved first describing the phones in both the source and target languages in terms of their articulatory positions, a process that leads to a description of the sounds using the International Phonetic Alphabet (IPA) [4].
The advantage of this approach is that all languages can, in theory, be represented within the same system. We determined the proximity of a sound in the target language to a sound in the source language using this representation, and developed an associated symbol-to-symbol mapping. While it was possible to achieve reasonable mappings for each language, there are significant variations in the level of detail used in the source language phonetic inventories. Spanish, for example, only used 25 phones, while Russian used 44 phones. We used these mappings to obtain baseline performance using acoustic models from the source languages derived from these mappings. The procedure was quite simple: represent each phone symbol in the Czech lexicon using a corresponding source language phone located from these mappings. The performance of systems constructed in this manner is given in Table 3. Overall, we observe that performance is poor - in the range of 80%WER. It was a great surprise to observe that the Russian acoustic models, though they were trained on read speech, were a close match to the VOA data, especially considering the differences in microphones, speaking style, and speaking rates. We also observed from these experiments that performance for English and Spanish was comparable, and performance for Mandarin lags the other systems.
| Source Language : Czech VOA WER (%) | |
| Russian : 60.8 | Spanish : 71.7 |
| English : 75.5 | Mandarin : 88.7 |
It was evident from the construction of the mappings that a single source language did not provide optimal coverage of Czech. Therefore, it was natural to explore a mapping that involved phones from all source languages based on proximity in the IPA table. Since Russian was clearly acoustically closer to Czech than any of the other source languages, we excluded Russian from the set of source languages for this experiment, so that it would not mask any trends in our knowledge-based systems. Though we achieved modest improvements in performance (1.6% absolute WER), we did not achieve performance comparable to data-driven mapping methods discussed next.
Our next attempt to understand deficiencies in the knowledge-based system was to explore a series of experiments in which the recognition system was allowed to chose the best combination of phones at runtime. First, we explored a parallel pronunciation approach [5] in which each item in the lexicon was represented as a sequence of phones from a single language implemented using pronunciation networks. Unfortunately, this approach resulted in slightly degraded performance even though we had hoped that the additional degrees of freedom would offset any systematic acoustic bias between the two domains. We next tried a multiphone approach that allowed the recognition system to mix and match phones from all source languages as an attempt to let the recognizer find the best realization of a phone, rather than fixing this based on a priori linguistic knowledge. We found minor improvement in performance over the parallel pronunciation system, as expected. However, overall performance is still below the best monolingual system, and far below the Russian monolingual system. In these experiments we have observed that, though the overall WER is high, performance at the phone-level appears to be quite good. The alignments are plausible, and a majority of the words are only partially misrecognized. Since Czech is an inflected language, this analysis raised some concerns that our language modeling approach was not optimal. For example, a morphologically-based approach might be better if the majority of the errors occur on endings rather than stems - it could be the case that performance at a morphological level is good, and hence the system would be usable for information extraction tasks.
We developed a general methodology to derive cross-language mappings automatically both at phonetic and sub-phonetic levels. We call our approach the Confusion Matrix approach to finding cross-lingual mappings. These confusion matrices are tables of acoustic similarity between phones across languages. They are obtained by first performing a mono-lingual phonetic labeling of the target language acoustic data using the target language phone set - this can be done manually or via forced-alignment using HMMs; we use the latter approach. Phonetic recognition of this data is then performed using acoustic models from each of the source languages; for this we used simple, unweighted, phone-loop recognizers. This yields multiple phonetic segmentations of the target language acoustic data in the source language phone inventories.
Once a criterion for co-occurrence between two phonetic labelings of the acoustic segments is defined (e.g., a minimum number of overlapping frames, etc.), we can arrange the phones of the source language and target language into a matrix that contains the counts of co-occurrences between the nth and kth phones of the source and target languages, respectively, in the (n,k) entry of the matrix. This matrix of co-occurrences is the confusion matrix.
After the confusion matrix between the phones of two languages is obtained, we derive mappings from this matrix. Given a source phone (in the nth row), we would like to select the phone in the target language that best matches it (i.e., choose the best matching kth column). To do this we can simply choose the column with the highest count. A better method takes into account the number of times the kth source language phone was hypothesized by dividing the counts of the bin (n,k) by the accumulated counts of the column k.
We extended this technique to the state level, motivated by our intuition that some phones seemed hard to match from one language to another. To obtain the subphonetic mapping, we broke each HMM in the source and target language into its conforming states and derived an HMM from each of these states. Using these new, sub-phone HMMs we constructed a new confusion matrix. As expected, we found that some of these hard-to-match target language phones were modeled by assembling new models from phonetic subunits from other languages.
We described above how we established the best mapping for each phone/state of the target language. We found out that when many states and phones from various languages were competing to represent any given target model, several models seemed to give high counts and thus be close candidates for a reasonable match. We explored the possibility of including several of these best matching candidates by combining the Gaussian models in their mixtures after weighting them accordingly. We established the weights used in this state combination in proportion to the normalized number of counts corresponding to the map.
Table 4 shows recognition experiments we conducted using mappings derived from confusion matrices. For comparison in this experiment, monophone Czech models trained on 1 hour of Czech give 38% WER. When mappings are obtained using the phone-level confusion matrix approach, the word error rate drops below 70%. State-level mappings further reduce the error rate of the English mappings. Better results are obtained when multiple source languages are included (English, Spanish and Mandarin), and state mappings are obtained for both state-to-state mapping and best three states to a single Czech state (the 3-state method). The best result is below 55% WER. The 3-state methods reported differ in the presence (54.4%) or absence (55.8%) of count normalization of the columns in the confusion matrix.
| Source(s)/Method | WER | Source(s)/Method | WER |
| EN/Phone | 68.3 | SP/Phone | 68.7 |
| EN/State | 64.8 | SP/State | 70.0 |
| MA/State | 79.7 | EN+SP+MA/State | 62.3 |
| EN+SP+MA/3-State | 55.8 | EN+SP+MA/3-State | 54.4 |
| Source | Mixtures / Type | Unadapted | MLLR+MAP |
| MA 10 hr. | 20 /monophone | 88.7 | 63.0 |
| SP 10 hr. | 20 / monophone | 71.6 | 50.9 |
| RU 3 hr. | 20 / monophone | 60.8 | 45.3 |
| EN 10 hr. | 20 / monophone | 75.7 | 47.2 |
| EN 10 hr. | 8 / triphone | 35.1 | |
| EN 72 hr. | 12 / triphone | 32.7 | |
| CZ 1 hr. | 20 / monophone | 33.4 | |
| CZ 1 hr. | 6 / triphone | 30.7 |
|
Although the results are not reported in detail here, we find that DMC rescoring at the sentence level does not improve over the monolingual Czech performance. However, performance can be improved by applying DMC at the phoneme-class level. For example, the acoustic likelihood Acz(x|k) can be separated by the contribution of vowels, consonants, and silence models. Parameters can then be introduced to define a posterior distribution based on these language-specific phonetic classes:
|
| Acoustic Scores and Phonetic Classes | WER(%) |
| N-Best oracle | 19.8 |
| first best (baseline) | 34.0 |
| Vru+Cru+Sru+Vsp+Csp+Ssp | 31.8 |
| Lcz+Acz+Aru+Asp+Aen | 29.2 |
| Lcz+Vcz+Ccz+Scz+Vru+Cru | |
| +Sru+Vsp+Csp+Ssp+Ven+Cen+Sen | 28.9 |
From the results in Table 6 we conclude that the structuring into phoneme classes improves performance over combination at the sentence level. Furthermore, combination of multilingual phoneme-class models performs better than the monolingual Czech systems, even when the monolingual systems are optimized using DMC.
ACKNOWLEDGMENTS This work was supported by the National Science Foundation under Grant No. #IIS-9820687, and carried out at the 1999 Workshop on Language Engineering, Center for Language and Speech Processing, Johns Hopkins University. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or The Johns Hopkins University. Satellite news broadcast recordings were done under contract by the Linguistic Data Consortium, Philadelphia, PA, USA. We thank M. Riley and F. Pereira of ATT for use of their large vocabulary decoder.
1 References and procedures are in the HTK documentation [6].