Time and Tokens: Benchmarking End-to-End Speech Dysfluency Detection

Xuanru Zhou,Jiachen Lian,Cheol Jun Cho,Jingwen Liu,Zongli Ye,Jinming Zhang,Brittany Morin,David Baquirin,Jet Vonk,Zoe Ezzes,Zachary Miller,Maria Luisa Gorno Tempini,Gopala Anumanchipalli
2024-09-20
Abstract:Speech dysfluency modeling is a task to detect dysfluencies in speech, such as repetition, block, insertion, replacement, and deletion. Most recent advancements treat this problem as a time-based object detection problem. In this work, we revisit this problem from a new perspective: tokenizing dysfluencies and modeling the detection problem as a token-based automatic speech recognition (ASR) problem. We propose rule-based speech and text dysfluency simulators and develop VCTK-token, and then develop a Whisper-like seq2seq architecture to build a new benchmark with decent performance. We also systematically compare our proposed token-based methods with time-based methods, and propose a unified benchmark to facilitate future research endeavors. We open-source these resources for the broader scientific community. The project page is available at <a class="link-external link-https" href="https://rorizzz.github.io/" rel="external noopener nofollow">this https URL</a>
Audio and Speech Processing,Artificial Intelligence,Sound
What problem does this paper attempt to address?