Abstract:State-of-the-art text-to-speech (TTS) synthesis models can produce monolingual speech with high intelligibility and naturalness. However, when the models are applied to synthesize code-switched (CS) speech, the performance declines seriously. Conventionally, developing a CS TTS system requires multilingual data to incorporate language-specific and cross-lingual knowledge. Recently, end-to-end (E2E) architecture has achieved satisfactory results in monolingual TTS. The architecture enables the training from one end of alphabetic text input to the other end of acoustic feature output. In this paper, we explore the use of E2E framework for CS TTS, using a combination of Mandarin and English monolingual speech corpus uttered by two female speakers. To handle alphabetic input from different languages, we explore two kinds of encoders: (1) shared multilingual encoder with explicit language embedding (LDE); (2) separated monolingual encoder (SPE) for each language. The two systems use identical decoder architecture, where a discriminative code is incorporated to enable the model to generate speech in one speaker's voice consistently. Experiments confirm the effectiveness of the proposed modifications on the E2E TTS framework in terms of quality and speaker similarity of the generated speech. Moreover, our proposed systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data.

End-to-End Speech Translation with Adversarial Training

End-to-End Tibetan-Chinese Speech Translation Based on Multi-task and Multi-level Pre-training

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation

Adversarial Neural Machine Translation.

Bridging the Modality Gap for Speech-to-Text Translation

Adversarial neural machine translation

Pre-Trained Acoustic-and-Textual Modeling for End-To-End Speech-To-Text Translation.

Non-Parametric Domain Adaptation for End-to-End Speech Translation

Self-Training for End-to-End Speech Translation

Adaptive multi-task learning for speech to text translation

Adversarial Multilingual Training for Low-Resource Speech Recognition.

Improving End-to-end Speech Translation by Leveraging Auxiliary Speech and Text Data.

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Adversarial Training for Unknown Word Problems in Neural Machine Translation

End-to-End Speech Translation with Knowledge Distillation

A Comparative Study on End-to-end Speech to Text Translation

AdvAug: Robust Adversarial Augmentation for Neural Machine Translation

End-to-End Automatic Speech Translation of Audiobooks

Joint Training and Decoding for Multilingual End-to-End Simultaneous Speech Translation