ASR Benchmarking: Need for a More Representative Conversational Dataset

Gaurav Maheshwari,Dmitry Ivanov,Théo Johannet,Kevin El Haddad

2024-09-18

Abstract:Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.

Computation and Language,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

The problem this paper attempts to address is the poor performance of current Automatic Speech Recognition (ASR) systems in real conversational environments. Existing benchmark datasets, such as LibriSpeech and Fleurs, although performing well in controlled environments, fail to adequately reflect the common disfluencies in actual conversations (such as pauses, interruptions, and diverse accents). Therefore, the authors propose a new multilingual conversational dataset based on TalkBank, which includes unstructured telephone conversations between adults, aiming to more realistically evaluate the performance of ASR systems in conversational settings. Specifically, the main contributions of the paper include: 1. **Processing the TalkBank dataset**: Creating a multilingual dataset containing unstructured conversations between adults, which include real conversational disfluencies. 2. **Benchmarking**: Conducting benchmark tests on various modern ASR systems using this dataset and comparing the results with those from existing benchmark datasets, finding a significant performance drop of ASR systems in conversational environments. 3. **Analyzing influencing factors**: Investigating the impact of different conversation-specific elements (such as laughter, pauses, etc.) on Word Error Rate (WER), discovering correlations between these elements and WER. Through this work, the paper highlights the shortcomings of current ASR models in handling real conversations and points out the need for more realistic conversational benchmark datasets to improve the performance of ASR systems.

ASR Benchmarking: Need for a More Representative Conversational Dataset

MediaSpeech: Multilanguage ASR Benchmark and Dataset

Speech Robust Bench: A Robustness Benchmark For Speech Recognition

Towards Measuring Fairness in Speech Recognition: Casual Conversations Dataset Transcriptions

Impact of ASR Performance on Free Speaking Language Assessment

Evaluation of state-of-the-art ASR Models in Child-Adult Interactions

DASB -- Discrete Audio and Speech Benchmark

ECAsT: a large dataset for conversational search and an evaluation of metric robustness

Benchmarking Representations for Speech, Music, and Acoustic Events

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words

WER We Stand: Benchmarking Urdu ASR Models

ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding

ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction

A New Benchmark for Evaluating Automatic Speech Recognition in the Arabic Call Domain

VoiceBench: Benchmarking LLM-Based Voice Assistants

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines

Towards measuring fairness in speech recognition: Fair-Speech dataset

Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach