Abstract:Stuttering is a complex speech disorder that affects people's fluent expression. People who stutter may exhibit various types of speech disfluencies. Speech-language pathologists typically diagnose stuttering based on the frequency and types of speech disfluencies. Recent studies have achieved automatic detection of speech disfluencies through deep learning methods. However, the key problem with existing studies is that training data contains limited labeled speech samples from a small number of individuals who speak a particular language, which may lead to poor test performance. To address this problem, we proposed a speech disfluency detection framework with multi-adversarial tasks and self-training to improve performance in individual-independent and cross-language testing. Firstly, the pre-trained wav2vec2 served as the feature extractor and a classification head detected different disfluency types. Secondly, we proposed speaker classification and language classification as auxiliary adversarial tasks to obfuscate speaker-related features and mitigate differences between languages. Thirdly, we leveraged the self-training framework to learn characteristics of speech disfluencies across datasets, indicating the feasibility of using unlabeled data. To obtain higher-quality pseudo labels, we proposed a confidence-based pseudo-label generation framework with a performance-based threshold updating strategy. Finally, we proposed time-related Grad-CAM to observe feature contributions to model decisions. The proposed methods were validated on German and English stuttered speech datasets. The UAR increased by 1.55% and 4.5% on same-language and cross-language test sets absolutely, respectively. Visualization results showed model's concerns were similar to human judgment criteria. These promising results demonstrated effectiveness of our methods and indicated potential applications of speech disfluency detection systems.

Automatic Speech Disfluency Detection Using Wav2vec2.0 for Different Languages with Variable Lengths

Individual-independent and cross-language detection of speech disfluencies in stuttering based on multi-adversarial tasks and self-training

Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0

Automatic Disfluency Detection from Untranscribed Speech

An Interpretable and Generalizable Speech Detector Based on a CNN-LSTM Framework

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Applying Wav2vec2.0 to Speech Recognition in Various Low-resource Languages

Automatic Fluency Assessment Method for Spontaneous Speech Without Reference Text

FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning

Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection

Augmenting Automatic Speech Recognition Models with Disfluency Detection

Enhancing Synthesized Speech Detection with Dual Attention Using Features Fusion

Exploring the Impact of Fine-Tuning the Wav2vec2 Model in Database-Independent Detection of Dysarthric Speech

Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

Stuttering Disfluency Detection Using Machine Learning Approaches.

Bi-LSTM-attention Based on ACNN Model for Disfluency Detection

Enhancing Neural Disfluency Detection with Hand-Crafted Features.

DistillW2V2: A Small and Streaming Wav2vec 2.0 Based ASR Model