Abstract:Introduction: Dysarthria, a motor speech disorder caused by neurological damage, significantly hampers speech intelligibility, creating communication barriers for affected individuals. Voice conversion (VC) systems have been developed to address this, yet accurately predicting phonemes in dysarthric speech remains a challenge due to its variability. This study proposes a novel approach that integrates Fuzzy Expectation Maximization (FEM) with diffusion models for enhanced phoneme prediction, aiming to improve the quality of dysarthric voice conversion. Methods: The proposed method combines FEM clustering with Diffusion Probabilistic Models (DPM). Diffusion models simulate noise addition and removal to enhance the robustness of speech signals, while FEM iteratively optimizes phoneme boundaries, reducing uncertainty. The system was trained using the Saarland University Voice Disorder dataset, consisting of dysarthric and normal speech samples, with the conversion process represented in the Mel-spectrogram domain. The framework employs both subjective (Mean Opinion Score, MOS) and objective (Word Error Rate, WER) metrics for evaluation, complemented by ablation studies. Results: Experimental results showed that the proposed method significantly improved phoneme prediction accuracy and overall voice conversion quality. It achieved higher MOSs for naturalness, intelligibility, and speaker similarity compared to existing models like StarGAN-VC and CycleGAN-VC. Additionally, the proposed method demonstrated a lower WER for both mild and severe dysarthria cases, indicating better performance in producing intelligible speech. Discussion: The integration of FEM with diffusion models offers substantial improvements in handling the irregularities of dysarthric speech. The method's robustness, as evidenced by the ablation studies, shows that it can maintain speech naturalness and intelligibility even without a speaker-encoder. These findings suggest that the proposed approach can contribute to the development of more reliable assistive communication technologies for individuals with dysarthria, providing a promising foundation for future advancements in personalized speech therapy.

Enhancing dysarthric speech recognition through SepFormer and hierarchical attention network models with multistage transfer learning

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation

Residual Convolutional Neural Network-Based Dysarthric Speech Recognition

Tran-DSR: A hybrid model for dysarthric speech recognition using transformer encoder and ensemble learning

UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization

Improving the Efficiency of Dysarthria Voice Conversion System Based on Data Augmentation

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

Use of Speech Impairment Severity for Dysarthric Speech Recognition

Improving Dysarthric Speech Segmentation With Emulated and Synthetic Augmentation

An approach for speech enhancement with dysarthric speech recognition using optimization based machine learning frameworks

A Strategic Approach for Robust Dysarthric Speech Recognition

Accurate synthesis of Dysarthric Speech for ASR data augmentation

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Exploiting Audio-Visual Features with Pretrained AV-HuBERT for Multi-Modal Dysarthric Speech Reconstruction

The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition

Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Speaker-Independent Dysarthria Severity Classification using Self-Supervised Transformers and Multi-Task Learning

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis

Enhancing Dysarthric Voice Conversion with Fuzzy Expectation Maximization in Diffusion Models for Phoneme Prediction