VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen,Xianghu Yue,Chen Zhang,Xiaoxue Gao,Robby T. Tan,Haizhou Li
2024-10-23
Abstract:Building on the success of large language models (LLMs), recent advancements such as GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering a significantly improved user experience compared to traditional text-based interactions. However, the absence of benchmarks designed to evaluate these speech interaction capabilities has hindered progress of LLM-based voice assistants development. Current evaluations focus primarily on automatic speech recognition (ASR) or general knowledge evaluation with clean speeches, neglecting the more intricate, real-world scenarios that involve diverse speaker characteristics, environmental and content factors. To address this, we introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.
Computation and Language,Artificial Intelligence,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the current lack of a standardized evaluation benchmark for large - language - model - based (LLM - based) voice assistants. Existing evaluation methods mainly focus on automatic speech recognition (ASR) or general - knowledge evaluation using high - quality text - to - speech (TTS) synthesis, ignoring more complex real - world scenarios that involve diverse speaker characteristics, environmental factors, and content variations. This limits the comprehensive understanding and improvement of the performance of LLM - based voice assistants. To fill this gap, the author introduces a new benchmarking tool - **VoiceBench**, which aims to provide a multi - dimensional evaluation framework to comprehensively evaluate the capabilities of LLM - based voice assistants. VoiceBench includes real and synthetic spoken instructions, covering three key real - world variations: speaker characteristics, environmental conditions, and content variations. Specifically, the goals of this paper are: 1. **Establish a comprehensive evaluation benchmark**: Provide a multi - dimensional evaluation framework for LLM - based voice assistants, covering general knowledge, instruction - following ability, and safety. 2. **Simulate real - world challenges**: Evaluate the performance of voice assistants in complex environments by introducing diverse speaker characteristics (such as age, accent, pitch), environmental conditions (such as background noise, echo, far - field conditions), and content variations (such as grammar mistakes, inaccurate pronunciation, and unsmooth expressions). 3. **Reveal the limitations of existing models**: Through extensive experiments, reveal the limitations of current LLM - based voice assistant models and provide valuable insights for future research and development. Through these efforts, VoiceBench aims to promote the progress of LLM - based voice assistant technology and ensure its higher robustness and reliability in various practical application scenarios.