Abstract:Non-parallel data voice conversion (VC) has achieved considerable breakthroughs due to self-supervised pre-trained representation (SSPR) being used in recent years. Features extracted by the pre-trained model are expected to contain more content information. However, in common VC with SSPR, there is no special implementation to remove speaker information in the content representation extraction by SSPR, which prevents further purification of the speaker information from SSPR representation. Moreover, in conventional VC, Mel-spectrogram is often selected as the reconstructed acoustic feature, which is not consistent with the input of the content encoder and results in some information lost. Motivated by the above, we proposed W2VC to settle the issues. W2VC consists of three parts: (1) We reconstruct feature from WavLM representation (WLMR) that is more consistent with the input of content encoder; (2) Connectionist temporal classification (CTC) is used to align content representation and text context from phoneme level, content encoder plus gradient reversal layer (GRL) based speaker classifier are used to remove speaker information in the content representation extraction; (3) WLMR-based HiFi-GAN is trained to convert WLMR to waveform speech. VC experimental results show that GRL can purify well the content information of the self-supervised model. The GRL purification and CTC supervision on the content encoder are complementary in improving the VC performance. Moreover, the synthesized speech using the WLMR retrained vocoder achieves better results in both subjective and objective evaluation. The proposed method is evaluated on the VCTK and CMU databases. It is shown the method achieves 8.901 in objective MCD, 4.45 in speech naturalness, and 3.62 in speaker similarity of subjective MOS score, which is superior to the baseline.

Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

NeuralVC: Any-to-Any Voice Conversion Using Neural Networks Decoder for Real-Time Voice Conversion

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

Improving the Efficiency of Dysarthria Voice Conversion System Based on Data Augmentation

Improving Recognition-Synthesis Based Any-to-one Voice Conversion with Cyclic Training

Iteratively Improving Speech Recognition and Voice Conversion

HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Towards General-Purpose Text-Instruction-Guided Voice Conversion

How far are we from robust voice conversion: a survey

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

ARVC: an Auto-Regressive Voice Conversion System Without Parallel Training Data

Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

CASIA Voice Conversion System for the Voice Conversion Challenge 2020

W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision