Abstract:State-of-the-art text-to-speech (TTS) synthesis models can produce monolingual speech with high intelligibility and naturalness. However, when the models are applied to synthesize code-switched (CS) speech, the performance declines seriously. Conventionally, developing a CS TTS system requires multilingual data to incorporate language-specific and cross-lingual knowledge. Recently, end-to-end (E2E) architecture has achieved satisfactory results in monolingual TTS. The architecture enables the training from one end of alphabetic text input to the other end of acoustic feature output. In this paper, we explore the use of E2E framework for CS TTS, using a combination of Mandarin and English monolingual speech corpus uttered by two female speakers. To handle alphabetic input from different languages, we explore two kinds of encoders: (1) shared multilingual encoder with explicit language embedding (LDE); (2) separated monolingual encoder (SPE) for each language. The two systems use identical decoder architecture, where a discriminative code is incorporated to enable the model to generate speech in one speaker's voice consistently. Experiments confirm the effectiveness of the proposed modifications on the E2E TTS framework in terms of quality and speaker similarity of the generated speech. Moreover, our proposed systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data.

A Novel Hybrid Mandarin Speech Synthesis System Using Different Base Units for Model Training and Concatenation

A Novel Hybrid Approach for Mandarin Speech Synthesis

A Hierarchical Viterbi Algorithm For Mandarin Hybrid Speech Synthesis System

A novel unit selection method for concatenation speech system using similarity measure

An Unified and Automatic Approach of Mandarin HTS System.

Syllable HMM Based Mandarin TTS and Comparison with Concatenative TTS.

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System

Comparison of Syllable/Phone HMM Based Mandarin TTS

Hybrid Unit Model Based Non-uniform Unit Selection

Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Mandarin-English Mixed TTS Based on HCSIPA

Hierarchical Non-Uniform Unit Selection Based on Prosodic Structure

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

Statistical Acoustic Model Based Unit Selection Algorithm for Speech Synthesis

A data driven method for target and concatenation cost calculation with KL-Divergence in Mandarin hybrid speech synthesis

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

UnitNet-Based Hybrid Speech Synthesis

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech