Abstract:Introduction: Dysarthria, a motor speech disorder caused by neurological damage, significantly hampers speech intelligibility, creating communication barriers for affected individuals. Voice conversion (VC) systems have been developed to address this, yet accurately predicting phonemes in dysarthric speech remains a challenge due to its variability. This study proposes a novel approach that integrates Fuzzy Expectation Maximization (FEM) with diffusion models for enhanced phoneme prediction, aiming to improve the quality of dysarthric voice conversion. Methods: The proposed method combines FEM clustering with Diffusion Probabilistic Models (DPM). Diffusion models simulate noise addition and removal to enhance the robustness of speech signals, while FEM iteratively optimizes phoneme boundaries, reducing uncertainty. The system was trained using the Saarland University Voice Disorder dataset, consisting of dysarthric and normal speech samples, with the conversion process represented in the Mel-spectrogram domain. The framework employs both subjective (Mean Opinion Score, MOS) and objective (Word Error Rate, WER) metrics for evaluation, complemented by ablation studies. Results: Experimental results showed that the proposed method significantly improved phoneme prediction accuracy and overall voice conversion quality. It achieved higher MOSs for naturalness, intelligibility, and speaker similarity compared to existing models like StarGAN-VC and CycleGAN-VC. Additionally, the proposed method demonstrated a lower WER for both mild and severe dysarthria cases, indicating better performance in producing intelligible speech. Discussion: The integration of FEM with diffusion models offers substantial improvements in handling the irregularities of dysarthric speech. The method's robustness, as evidenced by the ablation studies, shows that it can maintain speech naturalness and intelligibility even without a speaker-encoder. These findings suggest that the proposed approach can contribute to the development of more reliable assistive communication technologies for individuals with dysarthria, providing a promising foundation for future advancements in personalized speech therapy.

Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

Vocal Timbre Effects with Differentiable Digital Signal Processing

SiD-WaveFlow: A Low-Resource Vocoder Independent of Prior Knowledge

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

DDSP-SFX: Acoustically-guided sound effects generation with differentiable digital signal processing

Accurate synthesis of Dysarthric Speech for ASR data augmentation

A Waveform Representation Framework for High-quality Statistical Parametric Speech Synthesis

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer

Coding Speech through Vocal Tract Kinematics

Enhancing Dysarthric Voice Conversion with Fuzzy Expectation Maximization in Diffusion Models for Phoneme Prediction

DDSP: Differentiable Digital Signal Processing

Deep Speech Synthesis from MRI-Based Articulatory Representations

Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

VF-Taco2: Towards Fast and Lightweight Synthesis for Autoregressive Models with Variation Autoencoder and Feature Distillation.

Deep Speech Synthesis from Multimodal Articulatory Representations

Speaker-independent neural formant synthesis

Multi-Modal Acoustic-Articulatory Feature Fusion For Dysarthric Speech Recognition

NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation