ASR Benchmarking: Need for a More Representative Conversational Dataset

Gaurav Maheshwari,Dmitry Ivanov,Théo Johannet,Kevin El Haddad
2024-09-18
Abstract:Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem this paper attempts to address is the poor performance of current Automatic Speech Recognition (ASR) systems in real conversational environments. Existing benchmark datasets, such as LibriSpeech and Fleurs, although performing well in controlled environments, fail to adequately reflect the common disfluencies in actual conversations (such as pauses, interruptions, and diverse accents). Therefore, the authors propose a new multilingual conversational dataset based on TalkBank, which includes unstructured telephone conversations between adults, aiming to more realistically evaluate the performance of ASR systems in conversational settings. Specifically, the main contributions of the paper include: 1. **Processing the TalkBank dataset**: Creating a multilingual dataset containing unstructured conversations between adults, which include real conversational disfluencies. 2. **Benchmarking**: Conducting benchmark tests on various modern ASR systems using this dataset and comparing the results with those from existing benchmark datasets, finding a significant performance drop of ASR systems in conversational environments. 3. **Analyzing influencing factors**: Investigating the impact of different conversation-specific elements (such as laughter, pauses, etc.) on Word Error Rate (WER), discovering correlations between these elements and WER. Through this work, the paper highlights the shortcomings of current ASR models in handling real conversations and points out the need for more realistic conversational benchmark datasets to improve the performance of ASR systems.