Abstract:This article presents a method of sequence-to-sequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion is achieved by preserving the linguistic representations of source utterances while replacing the speaker representations with the target ones. Our model is built under the framework of encoder-decoder neural networks. A recognition encoder is designed to learn the disentangled linguistic representations with two strategies. First, phoneme transcriptions of training data are introduced to provide the references for leaning linguistic representations of audio signals. Second, an adversarial training strategy is employed to further wipe out speaker information from the linguistic representations. Meanwhile, speaker representations are extracted from audio signals by a speaker encoder. The model parameters are estimated by two-stage training, including a pre-training stage using a multi-speaker dataset and a fine-tuning stage using the dataset of a specific conversion pair. Since both the recognition encoder and the decoder for recovering acoustic features are seq2seq neural networks, there are no constrains of frame alignment and frame-by-frame conversion in our proposed method. Experimental results showed that our method obtained higher similarity and naturalness than the best non-parallel voice conversion method in Voice Conversion Challenge 2018. Besides, the performance of our proposed method was closed to the state-of-the-art parallel seq2seq voice conversion method.

Noise-robust voice conversion using adversarial training with multi-feature decoupling

Noise-robust voice conversion with domain adversarial training

A noise-robust voice conversion method with controllable background sounds

Multi-target Voice Conversion Without Parallel Data by Adversarially Learning Disentangled Audio Representations

Toward Degradation-Robust Voice Conversion

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

Residual Speaker Representation for One-Shot Voice Conversion

End-to-End Voice Conversion with Information Perturbation

DEFENDING YOUR VOICE: ADVERSARIAL ATTACK ON VOICE CONVERSION

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

HiFiDenoise: High-Fidelity Denoising Text to Speech with Adversarial Networks.

Adversarially Learning Disentangled Speech Representations for Robust Multi-Factor Voice Conversion

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

DualVC: Dual-mode Voice Conversion using Intra-model Knowledge Distillation and Hybrid Predictive Coding

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion