Abstract:Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative Adversarial Networks (GANs) can quickly generate high-quality samples, but the generated samples lack diversity. The samples generated by the Denoising Diffusion Probabilistic Models (DDPMs) are better than GANs in terms of mode coverage and sample diversity. But the DDPMs have high computational costs and the inference speed is slower than GANs. In order to make GANs and DDPMs more practical we proposes DiffGAN-VC, a variant of GANs and DDPMS, to achieve non-parallel many-to-many voice conversion (VC). We use large steps to achieve denoising, and also introduce a multimodal conditional GANs to model the denoising diffusion generative adversarial network. According to both objective and subjective evaluation experiments, DiffGAN-VC has been shown to achieve high voice quality on non-parallel data sets. Compared with the CycleGAN-VC method, DiffGAN-VC achieves speaker similarity, naturalness and higher sound quality.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve high - quality many - to - many voice conversion (VC) on non - parallel datasets. Specifically, the authors focus on how to improve the quality and diversity of voice conversion by improving generative adversarial networks (GANs) and denoising diffusion probabilistic models (DDPMs) in the absence of corresponding parallel corpora. Traditional voice conversion techniques usually rely on parallel datasets, that is, the recording contents of the source speaker and the target speaker are the same. However, in practical applications, collecting a large amount of parallel data is very time - consuming and difficult. Therefore, developing voice conversion methods that can work on non - parallel data has important practical significance. To overcome the limitations of existing methods on non - parallel datasets, the authors propose a new method named DiffGAN - VC. This method combines multi - modal conditional GANs and reconstructed denoising diffusion probabilistic models, aiming to improve the efficiency and performance of the model while reducing the computational complexity. DiffGAN - VC reduces the total denoising steps and the amount of computation by using large - step denoising operations, thereby speeding up the running speed of the model. In addition, this method also introduces multi - modal distributions to parameterize the denoising distribution to achieve large - step denoising, further improving the performance of the model. The paper evaluates the effectiveness of DiffGAN - VC through objective and subjective experiments. The experimental results show that, compared with several existing methods (such as VQVC, StarGAN - VC2, CycleGAN - VC2 and DDPM), DiffGAN - VC has a significant improvement in feature quality and conversion effect. In particular, DiffGAN - VC performs better than CycleGAN - VC on non - parallel datasets. It not only performs well in terms of naturalness and speaker similarity, but also achieves good results in multi - domain non - parallel voice conversion tasks.

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion

NVCGAN: Leveraging Generative Adversarial Networks for Robust Voice Conversion

CycleGAN-VC-GP: Improved CycleGAN-based Non-parallel Voice Conversion

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

Subband-based Generative Adversarial Network for Non-parallel Many-to-many Voice Conversion

FastDiff 2: Revisiting and Incorporating GANs and Diffusion Models in High-Fidelity Speech Synthesis

CVC: Contrastive Learning for Non-parallel Voice Conversion

Many-to-Many Voice Conversion using Conditional Cycle-Consistent Adversarial Networks

Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech

Multi-target Voice Conversion Without Parallel Data by Adversarially Learning Disentangled Audio Representations

Conditional GAN for Enhancing Diffusion Models in Efficient and Authentic Global Gesture Generation from Audios

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

An Adaptive Learning based Generative Adversarial Network for One-To-One Voice Conversion

Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion

Voice Conversion Using Deep Neural Networks with Layer-Wise Generative Training

FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation

Boosting Star-GANs for Voice Conversion with Contrastive Discriminator

GAZEV: GAN-Based Zero-Shot Voice Conversion over Non-parallel Speech Corpus

Emotional Voice Conversion With Cycle-consistent Adversarial Network

Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked Cycle-Consistent Generative Adversarial Networks