Abstract:Stuttering is a complex speech disorder that affects people's fluent expression. People who stutter may exhibit various types of speech disfluencies. Speech-language pathologists typically diagnose stuttering based on the frequency and types of speech disfluencies. Recent studies have achieved automatic detection of speech disfluencies through deep learning methods. However, the key problem with existing studies is that training data contains limited labeled speech samples from a small number of individuals who speak a particular language, which may lead to poor test performance. To address this problem, we proposed a speech disfluency detection framework with multi-adversarial tasks and self-training to improve performance in individual-independent and cross-language testing. Firstly, the pre-trained wav2vec2 served as the feature extractor and a classification head detected different disfluency types. Secondly, we proposed speaker classification and language classification as auxiliary adversarial tasks to obfuscate speaker-related features and mitigate differences between languages. Thirdly, we leveraged the self-training framework to learn characteristics of speech disfluencies across datasets, indicating the feasibility of using unlabeled data. To obtain higher-quality pseudo labels, we proposed a confidence-based pseudo-label generation framework with a performance-based threshold updating strategy. Finally, we proposed time-related Grad-CAM to observe feature contributions to model decisions. The proposed methods were validated on German and English stuttered speech datasets. The UAR increased by 1.55% and 4.5% on same-language and cross-language test sets absolutely, respectively. Visualization results showed model's concerns were similar to human judgment criteria. These promising results demonstrated effectiveness of our methods and indicated potential applications of speech disfluency detection systems.

A Novel Attention Model Across Heterogeneous Features for Stuttering Event Detection

Stuttering Disfluency Detection Using Machine Learning Approaches.

MMSD-Net: Towards Multi-modal Stuttering Detection

An End-To-End Stuttering Detection Method Based On Conformer And BILSTM

FGCL: Fine-grained Contrastive Learning For Mandarin Stuttering Event Detection

Individual-independent and cross-language detection of speech disfluencies in stuttering based on multi-adversarial tasks and self-training

Findings of the 2024 Mandarin Stuttering Event Detection and Automatic Speech Recognition Challenge

Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings

TranStutter: A Convolution-Free Transformer-Based Deep Learning Method to Classify Stuttered Speech Using 2D Mel-Spectrogram Visualization and Attention-Based Feature Representation

Self-supervised Speech Models for Word-Level Stuttered Speech Detection

FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning

Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection

FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models

A Joint Detection-Classification Model for Weakly Supervised Sound Event Detection Using Multi-Scale Attention Method

Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection

A Dynamic, Self Supervised, Large Scale AudioVisual Dataset for Stuttered Speech

Speech Emotion Recognition Using Attention Model

Stutter Diagnosis and Therapy System Based on Deep Learning

AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection

Soft-Median Selection: An adaptive feature smoothening method for sound event detection

Human–machine collaboration based sound event detection