Abstract:Acoustic features extracted from speech can help with the diagnosis of neurological diseases and monitoring of symptoms over time. Temporal segmentation of audio signals into individual words is an important pre-processing step needed prior to extracting acoustic features. Machine learning techniques could be used to automate speech segmentation via automatic speech recognition (ASR) and sequence to sequence alignment. While state-of-the-art ASR models achieve good performance on healthy speech, their performance significantly drops when evaluated on dysarthric speech. Fine-tuning ASR models on impaired speech can improve performance in dysarthric individuals, but it requires representative clinical data, which is difficult to collect and may raise privacy concerns. This study explores the feasibility of using two augmentation methods to increase ASR performance on dysarthric speech: 1) healthy individuals varying their speaking rate and loudness (as is often used in assessments of pathological speech); 2) synthetic speech with variations in speaking rate and accent (to ensure more diverse vocal representations and fairness). Experimental evaluations showed that fine-tuning a pre-trained ASR model with data from these two sources outperformed a model fine-tuned only on real clinical data and matched the performance of a model fine-tuned on the combination of real clinical data and synthetic speech. When evaluated on held-out acoustic data from 24 individuals with various neurological diseases, the best performing model achieved an average word error rate of 5.7% and a mean correct count accuracy of 94.4%. In segmenting the data into individual words, a mean intersection-over-union of 89.2% was obtained against manual parsing (ground truth). It can be concluded that emulated and synthetic augmentations can significantly reduce the need for real clinical data of dysarthric speech when fine-tuning ASR models and, in turn, for speech segmentation.

A New Method for Predicting Severity Level of Dysarthric Speech Based on Joint Feature-Sample Selection Using Audio-Visual Data

Audio-video Database from Subacute Stroke Patients for Dysarthric Speech Intelligence Assessment and Preliminary Analysis.

Automatic Assessment of Dysarthria Using Audio-visual Vowel Graph Attention Network

Artificial Intelligence‐Powered Acoustic Analysis System for Dysarthria Severity Assessment

[Acoustic Analysis for 21 Patients with Amyotrophic Lateral Sclerosis Complaining of Dysarthria].

Automated Dysarthria Severity Classification: A Study on Acoustic Features and Deep Learning Techniques

Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech

Pre-trained models for detection and severity level classification of dysarthria from speech

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

A hybrid model for pathological voice recognition of post-stroke dysarthria by using 1DCNN and double-LSTM networks

An SVM-Based Mandarin Pronunciation Quality Assessment System.

Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation

A Novel Acoustic Evaluation Method for the Diagnosis of Adductor Spasmodic Dysphonia

Use of Speech Impairment Severity for Dysarthric Speech Recognition

An Mandarin Pronunciation Quality Assessment System Using Two Kinds of Acoustic Models

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Using sustained vowels to identify patients with mild Parkinson's disease in a Chinese dataset

Deep-Learning-Based Automated Classification of Chinese Speech Sound Disorders

Voice Biomarker Analysis and Automated Severity Classification of Dysarthric Speech in a Multilingual Context

Language-Independent Approach for Automatic Computation of Vowel Articulation Features in Dysarthric Speech Assessment

End-to-End Articulatory Modeling for Dysarthric Articulatory Attribute Detection.