Adaptive Fusion of Acoustic and Visual Sources for Automatic Speech Recognition Alexandrina Rogozan Laboratoire d'Informatique de l'Université du Maine, rogozan@lium.univ-lemans.fr The use of additional knowledge together with the acoustic speech signal is a classical method to increase the robustness and the accuracy of automatic speech recognition systems. Numerous studies on speech perception emphasise the importance of visual information for speech recognition in humans. Therefore, the use of visual data mostly derived from the speaker's lip shape seems to be a promising way for speech recognition in machines, especially in an acoustically noisy environment. Our research work concerns the integration of visual information with acoustic information for the purpose of automatic speech recognition. Although this idea is very attractive, it raises numerous problems. One of the most debated questions is when exactly the sensory integration should take place in the audio-visual speech recognition process: at data level or at the results level. The sensory integration has to take the problem of asynchronous phenomena between the acoustic and the visual information into account. Furthermore the contribution of acoustic and visual modalities should be adapted depending on their relative reliability. Finally, the question of the relevance of using visual-specific decision units, the so-called visemes, for the processing of visible speech signal has to be answered. With these questions in mind, we developed continuous hidden Markov model-based audio-visual systems according to data fusion for direct integration (DI), result fusion for separate integration (SI) and hybrid fusion of type DI+SI. Each modality is involved in the recognition process with a different weight, which is dynamically adapted during this process according to the signal-to-noise ratio provided as a contextual input and with the phonetic content of pronounced sentences. We tested these audio-visual systems on a speaker-dependent continuous-spelling task of French letters. Experiments performed under various noise-level conditions show that, at the one hand, the system based on hybrid integration of type DI+SI performs better than either DI or SI-based systems and that, on the other hand, using adaptive modality weights allows for performance improvement. We also show that the most promising audio-visual system based on DI+SI may be improved by defining and using a viseme set adapted to the recognition task. This set is built from the visual data of the speaker by means of self-organising Kohonen maps. In order to reinforce the role of visemes, we used them with a discriminative learning based on neural networks. Our research work led to an audio-visual system using an adaptive hybrid integration DI+SI, of which the purely visual component is visemic and discriminative. This system is coherent with cognitive hybrid models coming from recent studies on audio-visual speech perception.