Abstract:Voice conversion (VC) emerged as a significant domain of research in the field of speech synthesis in recent years due to its emerging application in voice-assistive technologies, such as automated movie dubbing speech-to-singing conversion, to name a few. VC deals with the conversion of the vocal style of one speaker to another speaker while keeping the linguistic contents unchanged. Nowadays, generative adversarial network (GAN) models are widely used for speech feature mapping from the source speaker to the target speaker. In this article, we propose an adaptive-learning-based GAN model, called ALGAN-VC, to improve the one-to-one VC of speakers. Our ALGAN-VC framework consists of some approaches to improve the speech quality and voice similarity between the source and target speakers. We incorporate a dense residual network architecture into the generator network for efficient speech feature learning between source and target speakers. Our framework also includes an adaptive learning mechanism to compute the loss function for the proposed model. Moreover, a boosted learning rate approach is incorporated to enhance the learning capability of the proposed model. The proposed model is tested on Voice Conversion Challenge 2016, 2018, and 2020 datasets along with our self-prepared Indian regional-language-based speech dataset. In addition, an emotional speech dataset is also considered for evaluating the models performance. The objective and subjective evaluations of the generated speech samples indicated that the proposed model elegantly performed the voice conversion task by achieving high speaker similarity and good speech quality.

AdaptVC: High Quality Voice Conversion with Adaptive Learning

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

One-shot voice conversion based on speaker aware module

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

Voice Conversion towards Arbitrary Speakers With Limited Data.

Zero-shot voice conversion based on feature disentanglement

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

Again-VC: A One-Shot Voice Conversion Using Activation Guidance and Adaptive Instance Normalization

AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion

Non-parallel Sequence-to-Sequence Voice Conversion for Arbitrary Speakers.

SelfVC: Voice Conversion With Iterative Refinement using Self Transformations

HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts

Disentangling Content and Fine-Grained Prosody Information Via Hybrid ASR Bottleneck Features for Voice Conversion

Towards General-Purpose Text-Instruction-Guided Voice Conversion

Multi-target Voice Conversion Without Parallel Data by Adversarially Learning Disentangled Audio Representations

Iteratively Improving Speech Recognition and Voice Conversion

Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion

Zero-Shot Voice Conversion with Adjusted Speaker Embeddings and Simple Acoustic Features

An Adaptive Learning based Generative Adversarial Network for One-To-One Voice Conversion

Duration Controllable Voice Conversion via Phoneme-Based Information Bottleneck