Abstract:Stuttering is a complex speech disorder that affects people's fluent expression. People who stutter may exhibit various types of speech disfluencies. Speech-language pathologists typically diagnose stuttering based on the frequency and types of speech disfluencies. Recent studies have achieved automatic detection of speech disfluencies through deep learning methods. However, the key problem with existing studies is that training data contains limited labeled speech samples from a small number of individuals who speak a particular language, which may lead to poor test performance. To address this problem, we proposed a speech disfluency detection framework with multi-adversarial tasks and self-training to improve performance in individual-independent and cross-language testing. Firstly, the pre-trained wav2vec2 served as the feature extractor and a classification head detected different disfluency types. Secondly, we proposed speaker classification and language classification as auxiliary adversarial tasks to obfuscate speaker-related features and mitigate differences between languages. Thirdly, we leveraged the self-training framework to learn characteristics of speech disfluencies across datasets, indicating the feasibility of using unlabeled data. To obtain higher-quality pseudo labels, we proposed a confidence-based pseudo-label generation framework with a performance-based threshold updating strategy. Finally, we proposed time-related Grad-CAM to observe feature contributions to model decisions. The proposed methods were validated on German and English stuttered speech datasets. The UAR increased by 1.55% and 4.5% on same-language and cross-language test sets absolutely, respectively. Visualization results showed model's concerns were similar to human judgment criteria. These promising results demonstrated effectiveness of our methods and indicated potential applications of speech disfluency detection systems.

Explainable Stuttering Recognition Using Axial Attention.

Speech neuromuscular decoding based on spectrogram images using conformal predictors with Bi-LSTM.

Stuttering Speech Disfluency Prediction using Explainable Attribution Vectors of Facial Muscle Movements

TranStutter: A Convolution-Free Transformer-Based Deep Learning Method to Classify Stuttered Speech Using 2D Mel-Spectrogram Visualization and Attention-Based Feature Representation

MMSD-Net: Towards Multi-modal Stuttering Detection

Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0

Psychophysiological Arousal in Young Children Who Stutter: An Interpretable AI Approach

Individual-independent and cross-language detection of speech disfluencies in stuttering based on multi-adversarial tasks and self-training

An End-To-End Stuttering Detection Method Based On Conformer And BILSTM

YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection

Highly Accurate End-to-end Image Steganalysis Based on Auxiliary Information and Attention Mechanism

A Dynamic, Self Supervised, Large Scale AudioVisual Dataset for Stuttered Speech

Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings

A Novel Attention Model Across Heterogeneous Features for Stuttering Event Detection

Stuttering Disfluency Detection Using Machine Learning Approaches.

Stutter Diagnosis and Therapy System Based on Deep Learning

A Stutter Seldom Comes Alone -- Cross-Corpus Stuttering Detection as a Multi-label Problem

Enhancing ASR for Stuttered Speech with Limited Data Using Detect and Pass

Psychophysiology-aided Perceptually Fluent Speech Analysis of Children Who Stutter

Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection

FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models