Simulating Full-Duplex Conversations for Evaluating AI Systems

January 27, 2026

Rapid advances are being made in conversational artificial intelligence, including systems that support so called full-duplex spoken interactions, i.e. permitting users to speak naturally, interrupt system responses, and signal engagement via back-channels, as they would do with a human interlocuter.  However, methodologies and benchmarks for evaluating such systems are not keeping pace: a major challenge is the need to simulate human users to support automatic assessment, especially during system development.

Current approaches rely on cascaded system components, text-to-speech synthesis, and non-streaming APIs for user simulation, which fail to emulate the dynamics of a human interlocutor and creates a fundamental mismatch in latency and expressivity. Furthermore, current metrics are largely limited to simple heuristics for interruptions and backchannels, while also ignoring critical paralinguistic aspects such as style and emotion. These limitations motivate the need for further development of evaluation criteria in terms of conversational quality, competence, and paralinguistic awareness, as well as a full-duplex user simulator capable of generating realistic spoken interactions for systematic evaluation of spoken conversational AI.

A team lead by Ondřej Klejch (Edinburgh University, UK), with members from several universities and industry,  propose to build a simulator capable of engaging in targeted, controllable conversations with the AI systems under evaluation. They aim for a simulator capable of generating scenarios in which key full-duplex behaviors naturally arise: the simulator should not only perform behaviors such as interrupting, backchanneling, expressing emotion, and conducting coherent dialogue, but also be prompted to exhibit these behaviors on demand. In doing so, they will enable systematic and repeatable evaluation of a system’s conversational, full-duplex, and paralinguistic abilities. The project will be organized into three sub-projects: conversational user simulator, collaborative human-AI data annotation, and evaluation of spoken conversational models. 

Categories:

Center for Language and Speech Processing