Abstract:Acoustic features extracted from speech can help with the diagnosis of neurological diseases and monitoring of symptoms over time. Temporal segmentation of audio signals into individual words is an important pre-processing step needed prior to extracting acoustic features. Machine learning techniques could be used to automate speech segmentation via automatic speech recognition (ASR) and sequence to sequence alignment. While state-of-the-art ASR models achieve good performance on healthy speech, their performance significantly drops when evaluated on dysarthric speech. Fine-tuning ASR models on impaired speech can improve performance in dysarthric individuals, but it requires representative clinical data, which is difficult to collect and may raise privacy concerns. This study explores the feasibility of using two augmentation methods to increase ASR performance on dysarthric speech: 1) healthy individuals varying their speaking rate and loudness (as is often used in assessments of pathological speech); 2) synthetic speech with variations in speaking rate and accent (to ensure more diverse vocal representations and fairness). Experimental evaluations showed that fine-tuning a pre-trained ASR model with data from these two sources outperformed a model fine-tuned only on real clinical data and matched the performance of a model fine-tuned on the combination of real clinical data and synthetic speech. When evaluated on held-out acoustic data from 24 individuals with various neurological diseases, the best performing model achieved an average word error rate of 5.7% and a mean correct count accuracy of 94.4%. In segmenting the data into individual words, a mean intersection-over-union of 89.2% was obtained against manual parsing (ground truth). It can be concluded that emulated and synthetic augmentations can significantly reduce the need for real clinical data of dysarthric speech when fine-tuning ASR models and, in turn, for speech segmentation.

Towards a Single ASR Model That Generalizes to Disordered Speech

Personalized Automatic Speech Recognition Trained on Small Disordered Speech Datasets

Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation

On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech

Disordered Speech Recognition Considering Low Resources and Abnormal Articulation

Towards Automatic Data Augmentation for Disordered Speech Recognition

Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Evaluation of state-of-the-art ASR Models in Child-Adult Interactions

Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance

Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis

Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition

Fine-Tuning Automatic Speech Recognition for People with Parkinson's: An Effective Strategy for Enhancing Speech Technology Accessibility

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

A Voice User Interface on the Edge for People with Speech Impairments

Towards Measuring Fairness in Speech Recognition: Casual Conversations Dataset Transcriptions