Abstract:Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers. This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthric-ness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysartrhic-ness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthria

Speech Synthesis as Augmentation for Low-Resource ASR

Speech Recognition with Augmented Synthesized Speech

ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

Text Generation with Speech Synthesis for ASR Data Augmentation

Improving Low Resource Code-switched ASR using Augmented Code-switched TTS

Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis

The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement

Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation

Accurate synthesis of Dysarthric Speech for ASR data augmentation

Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models

Custom Data Augmentation for low resource ASR using Bark and Retrieval-Based Voice Conversion

ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Building African Voices

ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams

Synthetic Cross-accent Data Augmentation for Automatic Speech Recognition