Abstract:Despite the rapid progress of end-to-end (E2E) automatic speech recognition (ASR), it has been shown that incorporating external language models (LMs) into the decoding can further improve the recognition performance of E2E ASR systems. To align with the modeling units adopted in E2E ASR systems, subword-level (e.g., characters, BPE) LMs are usually used to cooperate with current E2E ASR systems. However, the use of subword-level LMs will ignore the word-level information, which may limit the strength of the external LMs in E2E ASR. Although several methods have been proposed to incorporate word-level external LMs in E2E ASR, these methods are mainly designed for languages with clear word boundaries such as English and cannot be directly applied to languages like Mandarin, in which each character sequence can have multiple corresponding word sequences. To this end, we propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences for each partial hypothesis. Then, the LM score of the hypothesis is obtained by intersecting the generated lattice with an external word N-gram LM. The proposed method is examined on both Attention-based Encoder-Decoder (AED) and Neural Transducer (NT) frameworks. Experiments suggest that our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM. We achieve state-of-the-art results on both Aishell-1 (CER 4.18%) and Aishell-2 (CER 5.06%) datasets and reduce CER by 14.8% relatively on a 21K-hour Mandarin dataset.

Tibetan-Mandarin Bilingual Speech Recognition Based on End-to-end Framework

A hybrid CTC+Attention model based on end-to-end framework for multilingual speech recognition

Towards Language-Universal Mandarin-English Speech Recognition

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

A comparable study of modeling units for end-to-end Mandarin speech recognition

Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

Investigation of Transformer Based Spelling Correction Model for CTC-based End-to-End Mandarin Speech Recognition

End-to-end keywords spotting based on connectionist temporal classification for Mandarin

Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

Advances in Cantonese Speech Recognition: A Language-Specific Pretraining Model and RNN-T Loss

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

Selection of acoustic modeling unit for Tibetan speech recognition based on deep learning

Enhancing CTC-based speech recognition with diverse modeling units

Exploring CTC Based End-to-End Techniques for Myanmar Speech Recognition

Mandarin-English bilingual phone modeling and combining MPE based Discriminative training for cross-language speech recognition

Improving Minority Language Speech Recognition Based on Distinctive Features

Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition

Chinese Dialect Speech Recognition Based on End-to-end Machine Learning

A Convenient and Extensible Offline Chinese Speech Recognition System Based on Convolutional CTC Networks

Acoustic Modeling With Dfsmn-Ctc And Joint Ctc-Ce Learning