Abstract:Introduction: Dysarthria, a motor speech disorder caused by neurological damage, significantly hampers speech intelligibility, creating communication barriers for affected individuals. Voice conversion (VC) systems have been developed to address this, yet accurately predicting phonemes in dysarthric speech remains a challenge due to its variability. This study proposes a novel approach that integrates Fuzzy Expectation Maximization (FEM) with diffusion models for enhanced phoneme prediction, aiming to improve the quality of dysarthric voice conversion. Methods: The proposed method combines FEM clustering with Diffusion Probabilistic Models (DPM). Diffusion models simulate noise addition and removal to enhance the robustness of speech signals, while FEM iteratively optimizes phoneme boundaries, reducing uncertainty. The system was trained using the Saarland University Voice Disorder dataset, consisting of dysarthric and normal speech samples, with the conversion process represented in the Mel-spectrogram domain. The framework employs both subjective (Mean Opinion Score, MOS) and objective (Word Error Rate, WER) metrics for evaluation, complemented by ablation studies. Results: Experimental results showed that the proposed method significantly improved phoneme prediction accuracy and overall voice conversion quality. It achieved higher MOSs for naturalness, intelligibility, and speaker similarity compared to existing models like StarGAN-VC and CycleGAN-VC. Additionally, the proposed method demonstrated a lower WER for both mild and severe dysarthria cases, indicating better performance in producing intelligible speech. Discussion: The integration of FEM with diffusion models offers substantial improvements in handling the irregularities of dysarthric speech. The method's robustness, as evidenced by the ablation studies, shows that it can maintain speech naturalness and intelligibility even without a speaker-encoder. These findings suggest that the proposed approach can contribute to the development of more reliable assistive communication technologies for individuals with dysarthria, providing a promising foundation for future advancements in personalized speech therapy.

Dual-path transformer-based network with equalization-generation components prediction for flexible vibrational sensor speech enhancement in the time domain

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

ForkNet: Simultaneous Time and Time-Frequency Domain Modeling for Speech Enhancement

Speech Enhancement with Perceptually-motivated Optimization and Dual Transformations

Improving the Intelligibility of Electric and Acoustic Stimulation Speech Using Fully Convolutional Networks Based Speech Enhancement

SETransformer: Speech Enhancement Transformer

Improving the Intelligibility of Speech for Simulated Electric and Acoustic Stimulation Using Fully Convolutional Neural Networks

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Enhancing Dysarthric Voice Conversion with Fuzzy Expectation Maximization in Diffusion Models for Phoneme Prediction

Deep Learning-Based Speech Enhancement of an Extrinsic Fabry–Perot Interferometric Fiber Acoustic Sensor System

Uformer: A Unet Based Dilated Complex & Real Dual-Path Conformer Network for Simultaneous Speech Enhancement and Dereverberation

FoVNet: Configurable Field-of-View Speech Enhancement with Low Computation and Distortion for Smart Glasses

Multi-Stage Progressive Speech Enhancement Network

Speech enhancement from fused features based on deep neural network and gated recurrent unit network

Parallel Gated Neural Network With Attention Mechanism For Speech Enhancement

CompNet: Complementary Network for Single-Channel Speech Enhancement.

Diff-ETS: Learning a Diffusion Probabilistic Model for Electromyography-to-Speech Conversion

Improving Deep Neural Network Based Speech Enhancement in Low SNR Environments

Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement

Efficient Monaural Speech Enhancement using Spectrum Attention Fusion

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.