Abstract:Voice conversion is to generate a new speech with the source content and a target voice style. In this paper, we focus on one general setting, i.e., non-parallel many-to-many voice conversion, which is close to the real-world scenario. As the name implies, non-parallel many-to-many voice conversion does not require the paired source and reference speeches and can be applied to arbitrary voice transfer. In recent years, Generative Adversarial Networks (GANs) and other techniques such as Conditional Variational Autoencoders (CVAEs) have made considerable progress in this field. However, due to the sophistication of voice conversion, the style similarity of the converted speech is still unsatisfactory. Inspired by the inherent structure of mel-spectrogram, we propose a new voice conversion framework, i.e., Subband-based Generative Adversarial Network for Voice Conversion (SGAN-VC). SGAN-VC converts each subband content of the source speech separately by explicitly utilizing the spatial characteristics between different subbands. SGAN-VC contains one style encoder, one content encoder, and one decoder. In particular, the style encoder network is designed to learn style codes for different subbands of the target speaker. The content encoder network can capture the content information on the source speech. Finally, the decoder generates particular subband content. In addition, we propose a pitch-shift module to fine-tune the pitch of the source speaker, making the converted tone more accurate and explainable. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art performance on VCTK Corpus and AISHELL3 datasets both qualitatively and quantitatively, whether on seen or unseen data. Furthermore, the content intelligibility of SGAN-VC on unseen data even exceeds that of StarGANv2-VC with ASR network assistance.

Pathological Voice Feature Generation Using Generative Adversarial Network

Pvd: A New Pathological Voice Dataset For Intra-Speaker Recognition Research Interest

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Nonlinear dynamic analysis of pathological voices

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

SingGAN: Generative Adversarial Network for High-Fidelity Singing Voice Generation

Targeted Speech Adversarial Example Generation With Generative Adversarial Network

DLT-GAN: Dual-Layer Transfer Generative Adversarial Network-Based Time Series Data Augmentation Method

Combined Generative Adversarial Network and Fuzzy C-Means Clustering for Multi-Class Voice Disorder Detection with an Imbalanced Dataset

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

GBNF-VAE: A Pathological Voice Enhancement Model Based on Gold Section for Bottleneck Feature With Variational Autoencoder

Statistical parametric speech synthesis using generative adversarial networks under a multi-task learning framework

Pathological Voice Feature Selection Based on Neural Network

Improving Pathological Voice Detection: A Weakly Supervised Learning Method

Pathological voice adaptation with autoencoder-based voice conversion

A 12-year-old boy with fever and blue ears.

Electroencephalographic Signal Data Augmentation Based on Improved Generative Adversarial Network

Study of GANs for Noisy Speech Simulation from Clean Speech

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

Pathological voice detection using optimized deep residual neural network and explainable artificial intelligence