Abstract:Dysarthria, a speech disorder often caused by neurological damage, compromises the control of vocal muscles in patients, making their speech unclear and communication troublesome. Recently, voice-driven methods have been proposed to improve the speech intelligibility of patients with dysarthria. However, most methods require a significant representation of both the patient's and target speaker's corpus, which is problematic. This study aims to propose a data augmentation-based voice conversion (VC) system to reduce the recording burden on the speaker. We propose dysarthria voice conversion 3.1 (DVC 3.1) based on a data augmentation approach, including text-to-speech and StarGAN-VC architecture, to synthesize a large target and patient-like corpus to lower the burden of recording. An objective evaluation metric of the Google automatic speech recognition (Google ASR) system and a listening test were used to demonstrate the speech intelligibility benefits of DVC 3.1 under free-talk conditions. The DVC system without data augmentation (DVC 3.0) was used for comparison. Subjective and objective evaluation based on the experimental results indicated that the proposed DVC 3.1 system enhanced the Google ASR of two dysarthria patients by approximately [62.4%, 43.3%] and [55.9%, 57.3%] compared to unprocessed dysarthria speech and the DVC 3.0 system, respectively. Further, the proposed DVC 3.1 increased the speech intelligibility of two dysarthria patients by approximately [54.2%, 22.3%] and [63.4%, 70.1%] compared to unprocessed dysarthria speech and the DVC 3.0 system, respectively. The proposed DVC 3.1 system offers significant potential to improve the speech intelligibility performance of patients with dysarthria and enhance verbal communication quality.

Voice Conversion towards Arbitrary Speakers With Limited Data.

Exploring Voice Conversion based Data Augmentation in Text-Dependent Speaker Verification

One-Shot Voice Conversion with Global Speaker Embeddings

Residual Speaker Representation for One-Shot Voice Conversion

How far are we from robust voice conversion: a survey

Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

Multi-target Voice Conversion Without Parallel Data by Adversarially Learning Disentangled Audio Representations

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Improving the Efficiency of Dysarthria Voice Conversion System Based on Data Augmentation

Innovative Speaker-Adaptive Style Transfer VAE-WadaIN for Enhanced Voice Conversion in Intelligent Speech Processing

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts

Who is Authentic Speaker

Voice Conversion Based on Gaussian Mixture Modules with Minimum Distance Spectral Mapping

ALO-VC: Any-to-any Low-latency One-shot Voice Conversion

Data Augmentation for Diverse Voice Conversion in Noisy Environments

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

MediumVC: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

Voice Conversion Based on Speaker Independent Model

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion