Abstract:BACKGROUND AND OBJECTIVE: Most dysarthric patients encounter communication problems due to unintelligible speech. Currently, there are many voice-driven systems aimed at improving their speech intelligibility; however, the intelligibility performance of these systems are affected by challenging application conditions (e.g., time variance of patient's speech and background noise). To alleviate these problems, we proposed a dysarthria voice conversion (DVC) system for dysarthric patients and investigated the benefits under challenging application conditions.METHOD: A deep learning-based voice conversion system with phonetic posteriorgram (PPG) features, called the DVC-PPG system, was proposed in this study. An objective-evaluation metric of Google automatic speech recognition (Google ASR) system and a listening test were used to demonstrate the speech intelligibility benefits of DVC-PPG under quiet and noisy test conditions; besides, the well-known voice conversion system using mel-spectrogram, DVC-Mels, was used for comparison to verify the benefits of the proposed DVC-PPG system.RESULTS: The objective-evaluation metric of Google ASR showed the average accuracy of two subjects in the duplicate and outside test conditions while the DVC-PPG system provided higher speech recognitions rate (83.2% and 67.5%) than dysarthric speech (36.5% and 26.9%) and DVC-Mels (52.9% and 33.8%) under quiet conditions. However, the DVC-PPG system provided more stable performance than the DVC-Mels under noisy test conditions. In addition, the results of the listening test showed that the speech-intelligibility performance of DVC-PPG was better than those obtained via the dysarthria speech and DVC-Mels under the duplicate and outside conditions, respectively.CONCLUSIONS: The objective-evaluation metric and listening test results showed that the recognition rate of the proposed DVC-PPG system was significantly higher than those obtained via the original dysarthric speech and DVC-Mels system. Therefore, it can be inferred from our study that the DVC-PPG system can improve the ability of dysarthric patients to communicate with people under challenging application conditions.

Electrolaryngeal Speech Enhancement Based on a Two Stage Framework with Bottleneck Feature Refinement and Voice Conversion

Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders

Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion

Mandarin Electrolaryngeal Speech Voice Conversion using Cross-domain Features

THLNet: two-stage heterogeneous lightweight network for monaural speech enhancement

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

Phonetic posteriorgram-based voice conversion system to improve speech intelligibility of dysarthric patients

Noise-robust voice conversion using adversarial training with multi-feature decoupling

Improving the Efficiency of Dysarthria Voice Conversion System Based on Data Augmentation

Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

A Joint Framework of Denoising Autoencoder and Generative Vocoder for Monaural Speech Enhancement

A noise-robust voice conversion method with controllable background sounds

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

EPG2S: Speech Generation and Speech Enhancement based on Electropalatography and Audio Signals using Multimodal Learning

Densely Connected Multi-Stage Model with Channel Wise Subband Feature for Real-Time Speech Enhancement.

Dual-Stage Low-Complexity Reconfigurable Speech Enhancement

Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training