Abstract:Introduction: Dysarthria, a motor speech disorder caused by neurological damage, significantly hampers speech intelligibility, creating communication barriers for affected individuals. Voice conversion (VC) systems have been developed to address this, yet accurately predicting phonemes in dysarthric speech remains a challenge due to its variability. This study proposes a novel approach that integrates Fuzzy Expectation Maximization (FEM) with diffusion models for enhanced phoneme prediction, aiming to improve the quality of dysarthric voice conversion. Methods: The proposed method combines FEM clustering with Diffusion Probabilistic Models (DPM). Diffusion models simulate noise addition and removal to enhance the robustness of speech signals, while FEM iteratively optimizes phoneme boundaries, reducing uncertainty. The system was trained using the Saarland University Voice Disorder dataset, consisting of dysarthric and normal speech samples, with the conversion process represented in the Mel-spectrogram domain. The framework employs both subjective (Mean Opinion Score, MOS) and objective (Word Error Rate, WER) metrics for evaluation, complemented by ablation studies. Results: Experimental results showed that the proposed method significantly improved phoneme prediction accuracy and overall voice conversion quality. It achieved higher MOSs for naturalness, intelligibility, and speaker similarity compared to existing models like StarGAN-VC and CycleGAN-VC. Additionally, the proposed method demonstrated a lower WER for both mild and severe dysarthria cases, indicating better performance in producing intelligible speech. Discussion: The integration of FEM with diffusion models offers substantial improvements in handling the irregularities of dysarthric speech. The method's robustness, as evidenced by the ablation studies, shows that it can maintain speech naturalness and intelligibility even without a speaker-encoder. These findings suggest that the proposed approach can contribute to the development of more reliable assistive communication technologies for individuals with dysarthria, providing a promising foundation for future advancements in personalized speech therapy.

Phonology-Augmented Statistical Framework for Machine Transliteration Using Limited Linguistic Resources

A High Accuracy Approach for Word-Phoneme Translation Using Neural Networks

Enhancing Cross-lingual Transfer via Phonemic Transcription Integration

Multimodal neural pronunciation modeling for spoken languages with logographic origin

Diversity by Phonetics and its Application in Neural Machine Translation

Statistically-based Model for Computer-Aided Transcription Application

Augmenting Part-of-speech Tagging with Syntactic Information for Vietnamese and Chinese

Reference-Based Post-OCR Processing with LLM for Diacritic Languages

Reducing pronunciation lexicon confusion and using more data without phonetic transcription for pronunciation modeling

AlloST: Low-resource Speech Translation without Source Transcription

Multilingual and Crosslingual Speech Recognition Using Phonological-Vector Based Phone Embeddings

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

A Combination of BERT and Transformer for Vietnamese Spelling Correction

LEARNING CROSS-LINGUAL INFORMATION WITH MULTILINGUAL BLSTM FOR SPEECH SYNTHESIS OF LOW-RESOURCE LANGUAGES

Incorporating L2 Phonemes Using Articulatory Features for Robust Speech Recognition

Letter-to-Sound Pronunciation Prediction Using Conditional Random Fields

Mitigating the Linguistic Gap with Phonemic Representations for Robust Cross-lingual Transfer

Pronunciation Generation for Foreign Language Words in Intra-Sentential Code-Switching Speech Recognition

PhoWhisper: Automatic Speech Recognition for Vietnamese

Enhancing Dysarthric Voice Conversion with Fuzzy Expectation Maximization in Diffusion Models for Phoneme Prediction

Transliteration Pair Extraction from Classical Chinese Buddhist Literature Using Phonetic Similarity Measurement