Abstract:Voice Conversion (VC) is a method of converting the source speaker's speech into the target speaker's speech without changing the source speaker's speech content. The current VC methods have the following problems: (1) they are only applicable to a limited number of speakers, not to any speakers, as a result, the application scenarios are greatly restricted; (2) the representation (feature) separation(RS) effect of the current mainstream technology is not ideal on the source speaker speech and the target speaker speech; and (3) the voice conversion quality of most models is unsatisfactory, and hence needs to be improved. Therefore, in this paper, we constructed a one-shot VC model of Representation Separation, called RS-VC model, implemented by the encoder-decoder structure. The encoder is composed of a content encoder and a speaker encoder. The content encoder separates the content information of the source speaker voice and generates a content representation. The speaker encoder separates the target speaker information of the target speaker voice and generates a speaker representation. The decoder synthesizes the content representation and the speaker representation to generate the converted voice. In this paper, we obtained the optimized speaker verification model SVIGEN2E (Speaker Verification with Instance Normalization using Generalized End-to-End loss) by improving the speaker verification (SV) model. The model SVIGEN2E is used as the speaker encoder. This speaker encoder needs to be trained in advance prior to RS-VC model training, and the pre-trained model of SVINGE2E directly extracts speaker representation of the target speaker's voice, and is used for training and testing RS-VC model. A progressive training method is proposed then for training RS-VC model. Experiments show that the progressive training method can effectively improve the quality of the converted voice. Compared with the basic speaker verification model, both SVINGE2E and RS-VC deliver the impressive improvements in EER (Equal Error Rate).

W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision

vec2wav 2.0: Advancing Voice Conversion via Discrete Token Vocoders

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

CLESSR-VC: Contrastive learning enhanced self-supervised representations for one-shot voice conversion

WaveVC: Speech and Fundamental Frequency Consistent Raw Audio Voice Conversion

S2VC - A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

Multi-level Temporal-channel Speaker Retrieval for Zero-shot Voice Conversion

Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers.

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling

Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion.

WaveNet Vocoder with Limited Training Data for Voice Conversion

CTEFM-VC: Zero-Shot Voice Conversion Based on Content-Aware Timbre Ensemble Modeling and Flow Matching

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion

Voice Conversion Based on Hybrid SVR and GMM

High-Quality Voice Conversion Using Spectrogram-Based Wavenet Vocoder

One-Shot Voice Conversion Algorithm Based on Representations Separation

Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked Cycle-Consistent Generative Adversarial Networks