Individual-independent and cross-language detection of speech disfluencies in stuttering based on multi-adversarial tasks and self-training

Jiakun Shen,Xueshuai Zhang
DOI: https://doi.org/10.1016/j.bspc.2024.107051
IF: 5.1
2024-10-20
Biomedical Signal Processing and Control
Abstract:Stuttering is a complex speech disorder that affects people's fluent expression. People who stutter may exhibit various types of speech disfluencies. Speech-language pathologists typically diagnose stuttering based on the frequency and types of speech disfluencies. Recent studies have achieved automatic detection of speech disfluencies through deep learning methods. However, the key problem with existing studies is that training data contains limited labeled speech samples from a small number of individuals who speak a particular language, which may lead to poor test performance. To address this problem, we proposed a speech disfluency detection framework with multi-adversarial tasks and self-training to improve performance in individual-independent and cross-language testing. Firstly, the pre-trained wav2vec2 served as the feature extractor and a classification head detected different disfluency types. Secondly, we proposed speaker classification and language classification as auxiliary adversarial tasks to obfuscate speaker-related features and mitigate differences between languages. Thirdly, we leveraged the self-training framework to learn characteristics of speech disfluencies across datasets, indicating the feasibility of using unlabeled data. To obtain higher-quality pseudo labels, we proposed a confidence-based pseudo-label generation framework with a performance-based threshold updating strategy. Finally, we proposed time-related Grad-CAM to observe feature contributions to model decisions. The proposed methods were validated on German and English stuttered speech datasets. The UAR increased by 1.55% and 4.5% on same-language and cross-language test sets absolutely, respectively. Visualization results showed model's concerns were similar to human judgment criteria. These promising results demonstrated effectiveness of our methods and indicated potential applications of speech disfluency detection systems.
engineering, biomedical
What problem does this paper attempt to address?