Abstract: Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice transfer. In recent years, Generative Adversarial Networks (GANs) and other techniques such as Conditional Variational Autoencoders (CVAEs) have made considerable progress in this field. However, due to the sophistication of voice conversion, the style similarity of the converted speech is still unsatisfactory. Inspired by the inherent structure of mel-spectrogram, we propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC). SGAN-VC converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands. SGAN-VC contains one style encoder, one content encoder, and one decoder. In particular, the style encoder network is designed to learn style codes for different subbands of the target speaker. The content encoder network can capture the content information on the source speech. Finally, the decoder generates particular subband content. In addition, we propose a pitch-shift module to fine-tune the pitch of the source speaker, making the converted tone more accurate and explainable. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on VCTK Corpus and AISHELL3 datasets both qualitatively and quantitatively, whether on seen or unseen data. Furthermore, the content intelligibility of SGAN-VC on unseen data even exceeds that of StarGANv2-VC with ASR network assistance.

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

High Fidelity Speech Synthesis with Adversarial Networks

Avocodo: Generative Adversarial Network for Artifact-free Vocoder

VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders

Wasserstein GAN and Waveform Loss-based Acoustic Model Training for Multi-speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

Very Low Complexity Speech Synthesis Using Framewise Autoregressive GAN (FARGAN) with Pitch Prediction

GANSynth: Adversarial Neural Audio Synthesis

An Adaptive Learning based Generative Adversarial Network for One-To-One Voice Conversion

StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization

VQCPC-GAN: Variable-Length Adversarial Audio Synthesis Using Vector-Quantized Contrastive Predictive Coding

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked Cycle-Consistent Generative Adversarial Networks

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

Parallel Synthesis for Autoregressive Speech Generation

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks