Abstract:Voice cloning, also known as personalized voice synthesis, is a significant branch of speech synthesis. It involves synthesizing speech with the same vocal characteristics as the target speaker. The field of voice cloning mainly faces two challenges: 1. Typically, only a small amount of voice data from the target speaker is available, necessitating the cloning of the speaker’s timbre using limited available data; 2. The voice data of the target speaker is often recorded with non-professional equipment in noisy environments, posing a significant challenge to the robustness of voice cloning systems. In this paper, we propose a noise-robust voice adaptation method based on speaker-independent bottleneck features.The method contains: 1. Using a disentanglement module based on autoencoder architecture to disentangle the timbre information from the speech data, resulting in the extraction of corresponding speaker-independent bottleneck features with environmental information; 2. With the help of the above module, the Text2BN (Text to Bottleneck) module is trained with high-quality voice data to establish a mapping from text to clean speaker-independent bottleneck features; 3. The decoder is fine-tuned using noisy target speaker speech data to adapt to the target speaker and is cascaded with the Text2BN module to synthesize clean audio. The disentanglement module does not require text transcriptions and not mandate the use of artificially paired clean/noisy datasets, enabling large-scale pre-training with massive real-world, untranscribed, noisy datasets to further enhance the model’s noise robustness and capabilities of decoupling timbre. During the cloning phase, there are no requirements for the recording conditions of the target speaker’s speech data or for text transcriptions, aligning more closely with real application scenarios. Since our model does not directly work on the noise level, it effectively avoids issues of insufficient robustness to out-of-distribution noise. Experiments demonstrate that the method can effectively utilize noisy target speaker speech data for voice cloning, achieving the preponderance of both speech quality and similarity in the synthesized speech.

AdaSpeech: Adaptive Text to Speech for Custom Voice

Adaspeech 2: Adaptive Text to Speech with Untranscribed Data

Adaptive Text to Speech for Spontaneous Style

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

AdaVocoder: Adaptive Vocoder for Custom Voice

AdaptiveFormer: A Few-shot Speaker Adaptive Speech Synthesis Model Based on FastSpeech2

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis

TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS

AS-Speech: Adaptive Style for Speech Synthesis

ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation

High quality, lightweight and adaptable TTS using LPCNet

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

NRAdapt: Noise-Robust Adaptive Text to Speech Using Untranscribed Data

Speaker Adaptation on Articulation and Acoustics for Articulation-to-Speech Synthesis

Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

Adapting TTS models For New Speakers using Transfer Learning