Abstract:Voice cloning, also known as personalized voice synthesis, is a significant branch of speech synthesis. It involves synthesizing speech with the same vocal characteristics as the target speaker. The field of voice cloning mainly faces two challenges: 1. Typically, only a small amount of voice data from the target speaker is available, necessitating the cloning of the speaker’s timbre using limited available data; 2. The voice data of the target speaker is often recorded with non-professional equipment in noisy environments, posing a significant challenge to the robustness of voice cloning systems. In this paper, we propose a noise-robust voice adaptation method based on speaker-independent bottleneck features.The method contains: 1. Using a disentanglement module based on autoencoder architecture to disentangle the timbre information from the speech data, resulting in the extraction of corresponding speaker-independent bottleneck features with environmental information; 2. With the help of the above module, the Text2BN (Text to Bottleneck) module is trained with high-quality voice data to establish a mapping from text to clean speaker-independent bottleneck features; 3. The decoder is fine-tuned using noisy target speaker speech data to adapt to the target speaker and is cascaded with the Text2BN module to synthesize clean audio. The disentanglement module does not require text transcriptions and not mandate the use of artificially paired clean/noisy datasets, enabling large-scale pre-training with massive real-world, untranscribed, noisy datasets to further enhance the model’s noise robustness and capabilities of decoupling timbre. During the cloning phase, there are no requirements for the recording conditions of the target speaker’s speech data or for text transcriptions, aligning more closely with real application scenarios. Since our model does not directly work on the noise level, it effectively avoids issues of insufficient robustness to out-of-distribution noise. Experiments demonstrate that the method can effectively utilize noisy target speaker speech data for voice cloning, achieving the preponderance of both speech quality and similarity in the synthesized speech.

NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis

NRAdapt: Noise-Robust Adaptive Text to Speech Using Untranscribed Data

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

AdaSpeech: Adaptive Text to Speech for Custom Voice

Deep Voice: Real-time Neural Text-to-Speech

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation

A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Adapting TTS models For New Speakers using Transfer Learning

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

FastSpeech: Fast, Robust and Controllable Text to Speech

Personalized Lightweight Text-to-Speech: Voice Cloning with Adaptive Structured Pruning

High quality, lightweight and adaptable TTS using LPCNet

Lightspeech: Lightweight Non-Autoregressive Multi-Speaker Text-To-Speech

PortaSpeech: Portable and High-Quality Generative Text-to-Speech

Speed-Aware Audio-Driven Speech Animation using Adaptive Windows

Light-tts: lightweight multi-speaker multi-lingual text-to-speech