Abstract:Introduction: Dysarthria, a motor speech disorder caused by neurological damage, significantly hampers speech intelligibility, creating communication barriers for affected individuals. Voice conversion (VC) systems have been developed to address this, yet accurately predicting phonemes in dysarthric speech remains a challenge due to its variability. This study proposes a novel approach that integrates Fuzzy Expectation Maximization (FEM) with diffusion models for enhanced phoneme prediction, aiming to improve the quality of dysarthric voice conversion. Methods: The proposed method combines FEM clustering with Diffusion Probabilistic Models (DPM). Diffusion models simulate noise addition and removal to enhance the robustness of speech signals, while FEM iteratively optimizes phoneme boundaries, reducing uncertainty. The system was trained using the Saarland University Voice Disorder dataset, consisting of dysarthric and normal speech samples, with the conversion process represented in the Mel-spectrogram domain. The framework employs both subjective (Mean Opinion Score, MOS) and objective (Word Error Rate, WER) metrics for evaluation, complemented by ablation studies. Results: Experimental results showed that the proposed method significantly improved phoneme prediction accuracy and overall voice conversion quality. It achieved higher MOSs for naturalness, intelligibility, and speaker similarity compared to existing models like StarGAN-VC and CycleGAN-VC. Additionally, the proposed method demonstrated a lower WER for both mild and severe dysarthria cases, indicating better performance in producing intelligible speech. Discussion: The integration of FEM with diffusion models offers substantial improvements in handling the irregularities of dysarthric speech. The method's robustness, as evidenced by the ablation studies, shows that it can maintain speech naturalness and intelligibility even without a speaker-encoder. These findings suggest that the proposed approach can contribute to the development of more reliable assistive communication technologies for individuals with dysarthria, providing a promising foundation for future advancements in personalized speech therapy.

An Improved Model for Voicing Silent Speech

Digital Voicing of Silent Speech

Silent Speech Decoding Using Spectrogram Features Based on Neuromuscular Activities

Hybrid Silent Speech Interface Through Fusion of Electroencephalography and Electromyography

Speech neuromuscular decoding based on spectrogram images using conformal predictors with Bi-LSTM.

Convolutional Neural Network applied in mime speech recognition using sEMG data

Sequence-to-Sequence Voice Reconstruction for Silent Speech in a Tonal Language

Improved Speech Reconstruction from Silent Video

Prediction of Voice Fundamental Frequency and Intensity from Surface Electromyographic Signals of the Face and Neck

Enhancing audio quality for expressive Neural Text-to-Speech

Enhancing Dysarthric Voice Conversion with Fuzzy Expectation Maximization in Diffusion Models for Phoneme Prediction

VisemeNet: Audio-Driven Animator-Centric Speech Animation

Leveraging Laryngograph Data for Robust Voicing Detection in Speech

Voice Synthesis Improvement by Machine Learning of Natural Prosody

Extracting Spatial Muscle Activation Patterns in Facial and Neck Muscles for Silent Speech Recognition Using High-Density sEMG

An Empirical Study on End-to-End Singing Voice Synthesis with Encoder-Decoder Architectures

Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Decoding Silent Speech Based on High-Density Surface Electromyogram Using Spatiotemporal Neural Network

Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model