Abstract: Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice transfer. In recent years, Generative Adversarial Networks (GANs) and other techniques such as Conditional Variational Autoencoders (CVAEs) have made considerable progress in this field. However, due to the sophistication of voice conversion, the style similarity of the converted speech is still unsatisfactory. Inspired by the inherent structure of mel-spectrogram, we propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC). SGAN-VC converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands. SGAN-VC contains one style encoder, one content encoder, and one decoder. In particular, the style encoder network is designed to learn style codes for different subbands of the target speaker. The content encoder network can capture the content information on the source speech. Finally, the decoder generates particular subband content. In addition, we propose a pitch-shift module to fine-tune the pitch of the source speaker, making the converted tone more accurate and explainable. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on VCTK Corpus and AISHELL3 datasets both qualitatively and quantitatively, whether on seen or unseen data. Furthermore, the content intelligibility of SGAN-VC on unseen data even exceeds that of StarGANv2-VC with ASR network assistance.

Any-to-Any Voice Conversion With Multi-Layer Speaker Adaptation and Content Supervision

Voice Conversion towards Arbitrary Speakers With Limited Data.

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

MediumVC: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

S2VC - A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations.

Improving Recognition-Synthesis Based Any-to-one Voice Conversion with Cyclic Training

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

NeuralVC: Any-to-Any Voice Conversion Using Neural Networks Decoder for Real-Time Voice Conversion

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

One-Shot Voice Conversion with Global Speaker Embeddings

Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion

Innovative Speaker-Adaptive Style Transfer VAE-WadaIN for Enhanced Voice Conversion in Intelligent Speech Processing

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

Highly Controllable Diffusion-based Any-to-Any Voice Conversion Model with Frame-level Prosody Feature

One-shot Voice Conversion For Style Transfer Based On Speaker Adaptation

Residual Speaker Representation for One-Shot Voice Conversion

HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts