Abstract:Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.

Duration optimization of speaker adaptation in Mandarin TTS

An unvoiced/voiced duration adjustment algorithm based on context features in mandarin TTS

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

Label Transform Based Cross-Language Speaker Adaptation in Bilingual (Mandarin-English) TTS

HMM Based TTS for Mixed Language Text.

Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Improving F0 prediction using bidirectional associative memories and syllable-level F0 features for HMM-based Mandarin speech synthesis

Duration Modeling of Neural TTS for Automatic Dubbing

Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System

Mandarin-English Mixed TTS Based on HCSIPA

Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation

Duration Model for post-processing in a Mandarin speech recognition system

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis

A Real-Time Tone Enhancement Method for Continuous Mandarin Speeches

Expressive, Variable, and Controllable Duration Modelling in TTS

Mandarin Speech Synthesis Based on Pitch Synchronous Time-Frequency Interpolation

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach