Abstract:Most existing voice conversion methods focus primarily on separating speech content from speaker information while overlooking the decoupling of pitch information. Additionally, the quality of converted speech significantly degrades when the speech of the target speaker is contaminated by noises. To address these issues, this paper proposes a noise-robust voice conversion model with multi-feature decoupling based on adversarial training . The proposed framework utilizes three distinct encoders to encode speech content, speaker identity, and pitch information independently, which aims to enhance the performance of decoupling by minimizing their mutual information and reduce the correlations between feature vectors. Moreover, a gradient reversal layer and a noise decoupling discriminator are incorporated into the framework, which extracts noise-resistant speaker representations and content representations through adversarial training to facilitate the synthesis of high-quality speech. In order to optimize the learning process, a training strategy is developed which involves alternating between clean and noisy data during the training of the encoder. This strategy effectively guides and expedites the convergence of the model. Experimental results demonstrate that compared to the state-of-the-art baselines of noise-robust voice conversion, the proposed model achieves improvements around 0.31 and 0.39 in terms of speech naturalness and speaker similarity evaluation metrics, respectively.

Hear Your Face: Face-based voice conversion with F0 estimation

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

Voice conversion using dynamic inter-frame features

Noise-robust voice conversion using adversarial training with multi-feature decoupling

Real-Time and Accurate: Zero-shot High-Fidelity Singing Voice Conversion with Multi-Condition Flow Synthesis

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

Singing voice conversion with non-parallel data

Pitch Transformation in Neural Network Based Voice Conversion

A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction

A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams

HiFi-SVC: Fast High Fidelity Cross-Domain Singing Voice Conversion.

Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer.

Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding

Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices

Controlled AutoEncoders to Generate Faces from Voices

Voice Conversion With Just Nearest Neighbors

Phone-aware LSTM-RNN for Voice Conversion