Abstract: Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice transfer. In recent years, Generative Adversarial Networks (GANs) and other techniques such as Conditional Variational Autoencoders (CVAEs) have made considerable progress in this field. However, due to the sophistication of voice conversion, the style similarity of the converted speech is still unsatisfactory. Inspired by the inherent structure of mel-spectrogram, we propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC). SGAN-VC converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands. SGAN-VC contains one style encoder, one content encoder, and one decoder. In particular, the style encoder network is designed to learn style codes for different subbands of the target speaker. The content encoder network can capture the content information on the source speech. Finally, the decoder generates particular subband content. In addition, we propose a pitch-shift module to fine-tune the pitch of the source speaker, making the converted tone more accurate and explainable. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on VCTK Corpus and AISHELL3 datasets both qualitatively and quantitatively, whether on seen or unseen data. Furthermore, the content intelligibility of SGAN-VC on unseen data even exceeds that of StarGANv2-VC with ASR network assistance.

Voice Conversion Using Deep Neural Networks with Layer-Wise Generative Training

Voice Conversion Using Generative Trained Deep Neural Networks with Multiple Frame Spectral Envelopes

Spectral Conversion Using Deep Neural Networks Trained with Multi-Source Speakers

Using bidirectional associative memories for joint spectral envelope modeling in voice conversion

Joint Spectral Distribution Modeling Using Restricted Boltzmann Machines For Voice Conversion

Deep Neural Network Based Voice Conversion with A Large Synthesized Parallel Corpus

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

GTDNN-Based Voice Conversion Using DAEs with Binary Distributed Hidden Units

GMM-based Voice Conversion with Explicit Modelling on Feature Transform

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

Voice Conversion Based on Gaussian Mixture Modules with Minimum Distance Spectral Mapping

Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

Restoring High Frequency Spectral Envelopes Using Neural Networks For Speech Bandwidth Extension

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

The USTC System for Voice Conversion Challenge 2016: Neural Network Based Approaches for Spectrum, Aperiodicity and F0 Conversion

Spectro-Temporal Modelling with Time-Frequency LSTM and Structured Output Layer for Voice Conversion

Voice Conversion Using Conditional Restricted Boltzmann Machine

VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics