Abstract:Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice transfer. In recent years, Generative Adversarial Networks (GANs) and other techniques such as Conditional Variational Autoencoders (CVAEs) have made considerable progress in this field. However, due to the sophistication of voice conversion, the style similarity of the converted speech is still unsatisfactory. Inspired by the inherent structure of mel-spectrogram, we propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC). SGAN-VC converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands. SGAN-VC contains one style encoder, one content encoder, and one decoder. In particular, the style encoder network is designed to learn style codes for different subbands of the target speaker. The content encoder network can capture the content information on the source speech. Finally, the decoder generates particular subband content. In addition, we propose a pitch-shift module to fine-tune the pitch of the source speaker, making the converted tone more accurate and explainable. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on VCTK Corpus and AISHELL3 datasets both qualitatively and quantitatively, whether on seen or unseen data. Furthermore, the content intelligibility of SGAN-VC on unseen data even exceeds that of StarGANv2-VC with ASR network assistance.

Voice Conversion with High Naturalness Using Spectrum and Super-segmental Feature Transform

High-Quality Voice Conversion Using Spectrogram-Based Wavenet Vocoder

Voice conversion using dynamic inter-frame features

Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer.

An improved method for voice conversion based on Gaussian mixture model

Pitch Transformation in Neural Network Based Voice Conversion

Voice Conversion Based on Gaussian Mixture Modules with Minimum Distance Spectral Mapping

An Improved Spectral And Prosodic Transformation Method In Straight-Based Voice Conversion

A Parametric Model for Voice Conversion

A noise-robust voice conversion method with controllable background sounds

A Compact Framework For Voice Conversion Using Wavenet Conditioned On Phonetic Posteriorgrams

Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

One-shot voice conversion using a combination of U2-Net and vector quantization

End-to-End Voice Conversion with Information Perturbation

A novel voice conversion system based on codebook mapping with phoneme-tied weighting.

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion