Abstract:Recently, two-pass end-to-end (E2E) automatic speech recognition (ASR) systems with the conformer model followed by a spelling correction backend have demonstrated remarkable progress and exceptional performance in general speech recognition tasks. However, these models may fail when they come to code-switching (CS) speech, where a speaker alternates words of two or more languages within a single sentence or across sentences. In this study, we propose a novel t ri-stage t raining two -pass (TripleT) E2E framework to improve the CS ASR performance by leveraging the individual attributes of each monolingual language. Our framework starts by introducing two symmetric language-specific encoders that are pre-trained using a large monolingual corpus. This improves the high-level acoustic representation of each individual language. Then, a bilingual acoustic learner (BAL) is proposed to combine these language-specific representations and transfer the monolingual acoustic attributes to code-switching properties. Next, these acoustic representations are further utilized to boost the spelling corrector by a context plus acoustic learner with the same structure as BAL. Finally, the whole proposed framework is fine-tuned using the CS corpus to achieve the final CS E2E ASR system. Our experiments are performed on a mixed training dataset consisting of 1000 hours of Mandarin data, 960 hours of English data, and 555.9 hours of Mandarin-English code-switching data. The ASR performances are evaluated on a 23.6 hours CS test set, and results show that our proposed TripleT-E2E framework achieves a 13.4% relative reduction in token error rate compared to a competitive two-pass E2E baseline model.

Toward On-Line Learning of Chinese Continuous Speech Recognition System.

Exploring RNN-Transducer for Chinese Speech Recognition

A Convenient and Extensible Offline Chinese Speech Recognition System Based on Convolutional CTC Networks

Towards Online Continuous Sign Language Recognition and Translation

Visual Information Assisted Mandarin Large Vocabulary Continuous Speech Recognition

Deep LSTM for Large Vocabulary Continuous Speech Recognition

Cantonese Automatic Speech Recognition Using Transfer Learning from Mandarin

Advances in Cantonese Speech Recognition: A Language-Specific Pretraining Model and RNN-T Loss

Improving the Syllable-Synchronous Network Search Algorithm for Word Decoding in Continuous Chinese Speech Recognition

Online Speaker Adaptation for LVCSR Based on Attention Mechanism

Towards Language-universal Mandarin-English Speech Recognition with Unsupervised Label Synchronous Adaptation

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

Mandarin Continuous Digit Speech Recognition System

Towards Language-Universal Mandarin-English Speech Recognition

Self-Attention Transducers for End-to-End Speech Recognition

Adaptive Speaker Normalization for CTC-Based Speech Recognition

Tri-stage training with language-specific encoder and bilingual acoustic learner for code-switching speech recognition

CNVSRC 2023: The First Chinese Continuous Visual Speech Recognition Challenge

Improving Online Incremental Speaker Adaptation with Eigen Feature Space MLLR.

DLD: An Optimized Chinese Speech Recognition Model Based on Deep Learning

Improving RNN transducer with normalized jointer network