Abstract:Building on the success of large language models (LLMs), recent advancements such as GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering a significantly improved user experience compared to traditional text-based interactions. However, the absence of benchmarks designed to evaluate these speech interaction capabilities has hindered progress of LLM-based voice assistants development. Current evaluations focus primarily on automatic speech recognition (ASR) or general knowledge evaluation with clean speeches, neglecting the more intricate, real-world scenarios that involve diverse speaker characteristics, environmental and content factors. To address this, we introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the current lack of a standardized evaluation benchmark for large - language - model - based (LLM - based) voice assistants. Existing evaluation methods mainly focus on automatic speech recognition (ASR) or general - knowledge evaluation using high - quality text - to - speech (TTS) synthesis, ignoring more complex real - world scenarios that involve diverse speaker characteristics, environmental factors, and content variations. This limits the comprehensive understanding and improvement of the performance of LLM - based voice assistants. To fill this gap, the author introduces a new benchmarking tool - **VoiceBench**, which aims to provide a multi - dimensional evaluation framework to comprehensively evaluate the capabilities of LLM - based voice assistants. VoiceBench includes real and synthetic spoken instructions, covering three key real - world variations: speaker characteristics, environmental conditions, and content variations. Specifically, the goals of this paper are: 1. **Establish a comprehensive evaluation benchmark**: Provide a multi - dimensional evaluation framework for LLM - based voice assistants, covering general knowledge, instruction - following ability, and safety. 2. **Simulate real - world challenges**: Evaluate the performance of voice assistants in complex environments by introducing diverse speaker characteristics (such as age, accent, pitch), environmental conditions (such as background noise, echo, far - field conditions), and content variations (such as grammar mistakes, inaccurate pronunciation, and unsmooth expressions). 3. **Reveal the limitations of existing models**: Through extensive experiments, reveal the limitations of current LLM - based voice assistant models and provide valuable insights for future research and development. Through these efforts, VoiceBench aims to promote the progress of LLM - based voice assistant technology and ensure its higher robustness and reliability in various practical application scenarios.

VoiceBench: Benchmarking LLM-Based Voice Assistants

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

DialogBench: Evaluating LLMs as Human-like Dialogue Systems

AudioBench: A Universal Benchmark for Audio Large Language Models

FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback

SimulBench: Evaluating Language Models with Creative Simulation Tasks

User Interaction Patterns and Breakdowns in Conversing with LLM-Powered Voice Assistants

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

LLF-Bench: Benchmark for Interactive Learning from Language Feedback

On the Benchmarking of LLMs for Open-Domain Dialogue Evaluation

AgentBench: Evaluating LLMs as Agents

PhonologyBench: Evaluating Phonological Skills of Large Language Models

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

TaskBench: Benchmarking Large Language Models for Task Automation

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?