Abstract:State-of-the-art text-to-speech (TTS) synthesis models can produce monolingual speech with high intelligibility and naturalness. However, when the models are applied to synthesize code-switched (CS) speech, the performance declines seriously. Conventionally, developing a CS TTS system requires multilingual data to incorporate language-specific and cross-lingual knowledge. Recently, end-to-end (E2E) architecture has achieved satisfactory results in monolingual TTS. The architecture enables the training from one end of alphabetic text input to the other end of acoustic feature output. In this paper, we explore the use of E2E framework for CS TTS, using a combination of Mandarin and English monolingual speech corpus uttered by two female speakers. To handle alphabetic input from different languages, we explore two kinds of encoders: (1) shared multilingual encoder with explicit language embedding (LDE); (2) separated monolingual encoder (SPE) for each language. The two systems use identical decoder architecture, where a discriminative code is incorporated to enable the model to generate speech in one speaker's voice consistently. Experiments confirm the effectiveness of the proposed modifications on the E2E TTS framework in terms of quality and speaker similarity of the generated speech. Moreover, our proposed systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data.

An Improved Cross-Language Model Adaptation Method for Speech Synthesis

Cross-Lingual Speaker Adaptation for HMM-Based Speech Synthesis

Label Transform Based Cross-Language Speaker Adaptation in Bilingual (Mandarin-English) TTS

HMM Based TTS for Mixed Language Text.

Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Improving Cross-Lingual Speech Synthesis with Triplet Training Scheme

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis

Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Spoken Content and Voice Factorization for Few-Shot Speaker Adaptation

Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis

LEARNING CROSS-LINGUAL INFORMATION WITH MULTILINGUAL BLSTM FOR SPEECH SYNTHESIS OF LOW-RESOURCE LANGUAGES

MAP-based Speaker Adaptation in Speech Synthesis

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System

ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation

Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

Duration optimization of speaker adaptation in Mandarin TTS