Abstract:Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech data. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions that could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we use two acoustic encoders to extract an utterance-level vector and a sequence of phoneme-level vectors from the target speech during training; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phoneme-level vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. Audio samples are available at <a class="link-external link-https" href="https://speechresearch.github.io/adaspeech/" rel="external noopener nofollow">this https URL</a>.

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

AdaSpeech: Adaptive Text to Speech for Custom Voice

ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

An investigation into the adaptability of a diffusion-based TTS model

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data

Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer

DiffVoice: Text-to-Speech with Latent Diffusion

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data

High quality, lightweight and adaptable TTS using LPCNet

DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

Efficient Adapter Finetuning for Tail Languages in Streaming Multilingual ASR

Adapting TTS models For New Speakers using Transfer Learning

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition