Abstract:Recently, two-pass end-to-end (E2E) automatic speech recognition (ASR) systems with the conformer model followed by a spelling correction backend have demonstrated remarkable progress and exceptional performance in general speech recognition tasks. However, these models may fail when they come to code-switching (CS) speech, where a speaker alternates words of two or more languages within a single sentence or across sentences. In this study, we propose a novel t ri-stage t raining two -pass (TripleT) E2E framework to improve the CS ASR performance by leveraging the individual attributes of each monolingual language. Our framework starts by introducing two symmetric language-specific encoders that are pre-trained using a large monolingual corpus. This improves the high-level acoustic representation of each individual language. Then, a bilingual acoustic learner (BAL) is proposed to combine these language-specific representations and transfer the monolingual acoustic attributes to code-switching properties. Next, these acoustic representations are further utilized to boost the spelling corrector by a context plus acoustic learner with the same structure as BAL. Finally, the whole proposed framework is fine-tuned using the CS corpus to achieve the final CS E2E ASR system. Our experiments are performed on a mixed training dataset consisting of 1000 hours of Mandarin data, 960 hours of English data, and 555.9 hours of Mandarin-English code-switching data. The ASR performances are evaluated on a 23.6 hours CS test set, and results show that our proposed TripleT-E2E framework achieves a 13.4% relative reduction in token error rate compared to a competitive two-pass E2E baseline model.

Code-switched speech synthesis using bilingual phonetic posteriorgram with only monolingual corpora

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Exploring Retraining-Free Speech Recognition for Intra-sentential Code-Switching

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Monolingual Recognizers Fusion for Code-switching Speech Recognition

Tri-stage training with language-specific encoder and bilingual acoustic learner for code-switching speech recognition

Pronunciation Generation for Foreign Language Words in Intra-Sentential Code-Switching Speech Recognition

Non-autoregressive Mandarin-English Code-switching Speech Recognition

Speech collage: code-switched audio generation by collaging monolingual corpora

Code-Switching without Switching: Language Agnostic End-to-End Speech Translation

Using heterogeneity in semi-supervised transcription hypotheses to improve code-switched speech recognition

Attention-Guided Adaptation for Code-Switching Speech Recognition

Language-agnostic Code-Switching in Sequence-To-Sequence Speech Recognition

LEARNING CROSS-LINGUAL INFORMATION WITH MULTILINGUAL BLSTM FOR SPEECH SYNTHESIS OF LOW-RESOURCE LANGUAGES

Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection

Unified model for code-switching speech recognition and language identification based on a concatenated tokenizer

Dynamic Soft Windowing and Language Dependent Style Token for Code-Switching End-to-End Speech Synthesis.

Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Towards Language-Universal Mandarin-English Speech Recognition

Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods.