Abstract:State-of-the-art text-to-speech (TTS) synthesis models can produce monolingual speech with high intelligibility and naturalness. However, when the models are applied to synthesize code-switched (CS) speech, the performance declines seriously. Conventionally, developing a CS TTS system requires multilingual data to incorporate language-specific and cross-lingual knowledge. Recently, end-to-end (E2E) architecture has achieved satisfactory results in monolingual TTS. The architecture enables the training from one end of alphabetic text input to the other end of acoustic feature output. In this paper, we explore the use of E2E framework for CS TTS, using a combination of Mandarin and English monolingual speech corpus uttered by two female speakers. To handle alphabetic input from different languages, we explore two kinds of encoders: (1) shared multilingual encoder with explicit language embedding (LDE); (2) separated monolingual encoder (SPE) for each language. The two systems use identical decoder architecture, where a discriminative code is incorporated to enable the model to generate speech in one speaker's voice consistently. Experiments confirm the effectiveness of the proposed modifications on the E2E TTS framework in terms of quality and speaker similarity of the generated speech. Moreover, our proposed systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data.

A Chinese Text-to-Speech System

High quality Chinese text-to-speech system - BEYOND

Design and implementation of a speaker recognition system

A New Chinese Text-to-speech System with High Naturalness

Research and Implementation of Text-to-speech System for Chinese

A Study on KD-863 Chinese Text-To-Speech

Total Quality Evaluation of Speech Synthesis Systems.

A Preliminary Study on Deep Learning-based Chinese Text to Taiwanese Speech Synthesis System

DOP-Tacotron: a Fast Chinese TTS System with Local-based Attention

A Miniature Chinese TTS System Based on Tailored Corpus

A unified front-end framework for English text-to-speech synthesis

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Text-to-speech Conversion Method of Dynamic Speech Alert System for Power Plant

Voice Style Cloning for Chinese Speech

Design and Implementation of Chinese TTS Engine:SmartTalk

Initial-Final Based Embedded Mandarin TTS System

Linguistic Pbolems in text-to-speech conversion Processing of the Chinese

Text-To-Visual Speech in Chinese Based on Data-Driven Approach

Objective Evaluation Methods for Chinese Text-To-Speech Systems

A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis

The WISTON Text to Speech System for Blizzard 2008