Abstract:Voice Conversion (VC) is a method of converting the source speaker's speech into the target speaker's speech without changing the source speaker's speech content. The current VC methods have the following problems: (1) they are only applicable to a limited number of speakers, not to any speakers, as a result, the application scenarios are greatly restricted; (2) the representation (feature) separation(RS) effect of the current mainstream technology is not ideal on the source speaker speech and the target speaker speech; and (3) the voice conversion quality of most models is unsatisfactory, and hence needs to be improved. Therefore, in this paper, we constructed a one-shot VC model of Representation Separation, called RS-VC model, implemented by the encoder-decoder structure. The encoder is composed of a content encoder and a speaker encoder. The content encoder separates the content information of the source speaker voice and generates a content representation. The speaker encoder separates the target speaker information of the target speaker voice and generates a speaker representation. The decoder synthesizes the content representation and the speaker representation to generate the converted voice. In this paper, we obtained the optimized speaker verification model SVIGEN2E (Speaker Verification with Instance Normalization using Generalized End-to-End loss) by improving the speaker verification (SV) model. The model SVIGEN2E is used as the speaker encoder. This speaker encoder needs to be trained in advance prior to RS-VC model training, and the pre-trained model of SVINGE2E directly extracts speaker representation of the target speaker's voice, and is used for training and testing RS-VC model. A progressive training method is proposed then for training RS-VC model. Experiments show that the progressive training method can effectively improve the quality of the converted voice. Compared with the basic speaker verification model, both SVINGE2E and RS-VC deliver the impressive improvements in EER (Equal Error Rate).

Unifying One-Shot Voice Conversion and Cloning with Disentangled Speech Representations

Disentangled Speech Representation Learning for One-Shot Cross-Lingual Voice Conversion Using SS-Vae

MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion.

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

Unsupervised Representation Disentanglement using Cross Domain Features and Adversarial Learning in Variational Autoencoder based Voice Conversion

Zero-shot voice conversion based on feature disentanglement

One-Shot Voice Conversion Algorithm Based on Representations Separation

Residual Speaker Representation for One-Shot Voice Conversion

One-shot voice conversion using a combination of U2-Net and vector quantization

UNET-TTS: Improving Unseen Speaker and Style Transfer in One-Shot Voice Cloning

A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

One-Shot Voice Conversion by Vector Quantization

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

One-shot Emotional Voice Conversion Based on Feature Separation

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net Architecture

One-Shot Voice Conversion with Global Speaker Embeddings

StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models