Abstract:Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech data. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions that could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we use two acoustic encoders to extract an utterance-level vector and a sequence of phoneme-level vectors from the target speech during training; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phoneme-level vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. Audio samples are available at <a class="link-external link-https" href="https://speechresearch.github.io/adaspeech/" rel="external noopener nofollow">this https URL</a>.

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding

Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

DualSpeech: Enhancing Speaker-Fidelity and Text-Intelligibility Through Dual Classifier-Free Guidance

Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis

Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation

AdaSpeech: Adaptive Text to Speech for Custom Voice

ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation

Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition

High-Quality Automatic Voice Over with Accurate Alignment: Supervision through Self-Supervised Discrete Speech Units

Efficient Adapter Tuning of Pre-trained Speech Models for Automatic Speaker Verification