Abstract:State-of-the-art text-to-speech (TTS) synthesis models can produce monolingual speech with high intelligibility and naturalness. However, when the models are applied to synthesize code-switched (CS) speech, the performance declines seriously. Conventionally, developing a CS TTS system requires multilingual data to incorporate language-specific and cross-lingual knowledge. Recently, end-to-end (E2E) architecture has achieved satisfactory results in monolingual TTS. The architecture enables the training from one end of alphabetic text input to the other end of acoustic feature output. In this paper, we explore the use of E2E framework for CS TTS, using a combination of Mandarin and English monolingual speech corpus uttered by two female speakers. To handle alphabetic input from different languages, we explore two kinds of encoders: (1) shared multilingual encoder with explicit language embedding (LDE); (2) separated monolingual encoder (SPE) for each language. The two systems use identical decoder architecture, where a discriminative code is incorporated to enable the model to generate speech in one speaker's voice consistently. Experiments confirm the effectiveness of the proposed modifications on the E2E TTS framework in terms of quality and speaker similarity of the generated speech. Moreover, our proposed systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data.

Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks

Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement

Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Extremely Low Footprint End-to-End ASR System for Smart Device

Accelerating Transducers through Adjacent Token Merging

Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

Efficient Decoding Self-Attention for End-to-end Speech Synthesis

Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer Attention

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model.

Transformer-Transducers for Code-Switched Speech Recognition

Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks

SETransformer: Speech Enhancement Transformer

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

Extending Whisper with prompt tuning to target-speaker ASR

ESAformer: Enhanced Self-Attention for Automatic Speech Recognition

Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding

Enhancing CTC-based speech recognition with diverse modeling units

A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation