Abstract:The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the prosody of the source speech to the target, we particularly introduce a speaker-related pitch encoder which can maintain the general pitch pattern of the source speaker while flexibly modifying the pitch intensity of the generated speech. Finally, one-shot voice conversion is set up through continuous speaker space modeling. Experimental results indicate that the proposed end-to-end approach significantly outperforms the state-of-the-art models in terms of intelligibility, naturalness, and speaker similarity.

Voice conversion based on improved GMM and spectrum with synchronous prosody

An improved method for voice conversion based on Gaussian mixture model

An Improved Spectral And Prosodic Transformation Method In Straight-Based Voice Conversion

Voice conversion using dynamic inter-frame features

Voice Conversion with Smoothed GMM and MAP Adaptation

GMM-based Voice Conversion with Explicit Modelling on Feature Transform

Towards Fine-Grained Prosody Control for Voice Conversion

Voice Conversion Based on Gaussian Mixture Modules with Minimum Distance Spectral Mapping

A Parametric Model for Voice Conversion

A hybrid method to convert acoustic features for voice conversion

Voice Conversion Based on Speaker Independent Model

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

A hybrid GMM and codebook mapping method for spectral conversion

Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer.

Spectro-Temporal Modelling with Time-Frequency Lstm and Structured Output Layer for Voice Conversion

Spectrum and Prosody Conversion for Cross-lingual Voice Conversion with CycleGAN

Dblstm-Based Multi-Task Learning for Pitch Transformation in Voice Conversion

End-to-End Voice Conversion with Information Perturbation

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature

PPG-based singing voice conversion with adversarial representation learning

A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams