Abstract:Voice cloning, also known as personalized voice synthesis, is a significant branch of speech synthesis. It involves synthesizing speech with the same vocal characteristics as the target speaker. The field of voice cloning mainly faces two challenges: 1. Typically, only a small amount of voice data from the target speaker is available, necessitating the cloning of the speaker’s timbre using limited available data; 2. The voice data of the target speaker is often recorded with non-professional equipment in noisy environments, posing a significant challenge to the robustness of voice cloning systems. In this paper, we propose a noise-robust voice adaptation method based on speaker-independent bottleneck features.The method contains: 1. Using a disentanglement module based on autoencoder architecture to disentangle the timbre information from the speech data, resulting in the extraction of corresponding speaker-independent bottleneck features with environmental information; 2. With the help of the above module, the Text2BN (Text to Bottleneck) module is trained with high-quality voice data to establish a mapping from text to clean speaker-independent bottleneck features; 3. The decoder is fine-tuned using noisy target speaker speech data to adapt to the target speaker and is cascaded with the Text2BN module to synthesize clean audio. The disentanglement module does not require text transcriptions and not mandate the use of artificially paired clean/noisy datasets, enabling large-scale pre-training with massive real-world, untranscribed, noisy datasets to further enhance the model’s noise robustness and capabilities of decoupling timbre. During the cloning phase, there are no requirements for the recording conditions of the target speaker’s speech data or for text transcriptions, aligning more closely with real application scenarios. Since our model does not directly work on the noise level, it effectively avoids issues of insufficient robustness to out-of-distribution noise. Experiments demonstrate that the method can effectively utilize noisy target speaker speech data for voice cloning, achieving the preponderance of both speech quality and similarity in the synthesized speech.

NRAdapt: Noise-Robust Adaptive Text to Speech Using Untranscribed Data

DENOISPEECH: DENOISING TEXT TO SPEECH WITH FRAME-LEVEL NOISE MODELING

Data Efficient Voice Cloning for Neural Singing Synthesis

A real-time voice cloning system with multiple algorithms for speech quality improvement

NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language

Adapting TTS models For New Speakers using Transfer Learning

Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning

Multi-modal Adversarial Training for Zero-Shot Voice Cloning

OpenVoice: Versatile Instant Voice Cloning

Deep Voice: Real-time Neural Text-to-Speech

AdaSpeech: Adaptive Text to Speech for Custom Voice

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

Noise Robust TTS for Low Resource Speakers using Pre-trained Model and Speech Enhancement

Adaspeech 2: Adaptive Text to Speech with Untranscribed Data

High quality, lightweight and adaptable TTS using LPCNet

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Neural Fusion for Voice Cloning

A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

DIAN: DURATION INFORMED AUTO-REGRESSIVE NETWORK FOR VOICE CLONING

Zero-Shot Voice Cloning Text-to-Speech for Dysphonia Disorder Speakers