Abstract:Custom voice, a specific text to speech (TTS) service in commercial speech platforms, aims to adapt a source TTS model to synthesize personal voice for a target speaker using few speech data. Custom voice presents two unique challenges for TTS adaptation: 1) to support diverse customers, the adaptation model needs to handle diverse acoustic conditions that could be very different from source speech data, and 2) to support a large number of customers, the adaptation parameters need to be small enough for each target speaker to reduce memory usage while maintaining high voice quality. In this work, we propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices. We design several techniques in AdaSpeech to address the two challenges in custom voice: 1) To handle different acoustic conditions, we use two acoustic encoders to extract an utterance-level vector and a sequence of phoneme-level vectors from the target speech during training; in inference, we extract the utterance-level vector from a reference speech and use an acoustic predictor to predict the phoneme-level vectors. 2) To better trade off the adaptation parameters and voice quality, we introduce conditional layer normalization in the mel-spectrogram decoder of AdaSpeech, and fine-tune this part in addition to speaker embedding for adaptation. We pre-train the source TTS model on LibriTTS datasets and fine-tune it on VCTK and LJSpeech datasets (with different acoustic conditions from LibriTTS) with few adaptation data, e.g., 20 sentences, about 1 minute speech. Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice. Audio samples are available at <a class="link-external link-https" href="https://speechresearch.github.io/adaspeech/" rel="external noopener nofollow">this https URL</a>.

Speaker Adaptive Text-to-Speech with Timbre-Normalized Vector-Quantized Feature.

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge

Optimizing feature fusion for improved zero-shot adaptation in text-to-speech synthesis

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

AS-Speech: Adaptive Style for Speech Synthesis

StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

AdaSpeech: Adaptive Text to Speech for Custom Voice

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Few-Shot Custom Speech Synthesis with Multi-Angle Fusion