Abstract:State-of-the-art text-to-speech (TTS) synthesis models can produce monolingual speech with high intelligibility and naturalness. However, when the models are applied to synthesize code-switched (CS) speech, the performance declines seriously. Conventionally, developing a CS TTS system requires multilingual data to incorporate language-specific and cross-lingual knowledge. Recently, end-to-end (E2E) architecture has achieved satisfactory results in monolingual TTS. The architecture enables the training from one end of alphabetic text input to the other end of acoustic feature output. In this paper, we explore the use of E2E framework for CS TTS, using a combination of Mandarin and English monolingual speech corpus uttered by two female speakers. To handle alphabetic input from different languages, we explore two kinds of encoders: (1) shared multilingual encoder with explicit language embedding (LDE); (2) separated monolingual encoder (SPE) for each language. The two systems use identical decoder architecture, where a discriminative code is incorporated to enable the model to generate speech in one speaker's voice consistently. Experiments confirm the effectiveness of the proposed modifications on the E2E TTS framework in terms of quality and speaker similarity of the generated speech. Moreover, our proposed systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data.

A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

Improving Prosody with Linguistic and Bert Derived Features in Multi-Speaker Based Mandarin Chinese Neural TTS

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

DOP-Tacotron: a Fast Chinese TTS System with Local-based Attention

Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

Scalable Multilingual Frontend for TTS

Unified Mandarin TTS Front-end Based on Distilled BERT Model

Unsupervised Neural Dialect Translation with Commonality and Diversity Modeling

Non-Autoregressive End-to-End TTS with Coarse-to-Fine Decoding

A unified front-end framework for English text-to-speech synthesis

Neural Speech Synthesis with Transformer Network.

Bidirectional Decoding Tacotron for Attention Based Neural Speech Synthesis

Modeling Bilingual Conversational Characteristics for Neural Chat Translation

An Optimized Neural Network Based Prosody Model of Chinese Speech Synthesis System

A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-speech System

Building Multi lingual TTS using Cross Lingual Voice Conversion

The NTU-AISG Text-to-speech System for Blizzard Challenge 2020

A Transformer-based Chinese Non-autoregressive Speech Synthesis Scheme

Improving Cross-Lingual Speech Synthesis with Triplet Training Scheme