Abstract:Despite the rapid progress of end-to-end (E2E) automatic speech recognition (ASR), it has been shown that incorporating external language models (LMs) into the decoding can further improve the recognition performance of E2E ASR systems. To align with the modeling units adopted in E2E ASR systems, subword-level (e.g., characters, BPE) LMs are usually used to cooperate with current E2E ASR systems. However, the use of subword-level LMs will ignore the word-level information, which may limit the strength of the external LMs in E2E ASR. Although several methods have been proposed to incorporate word-level external LMs in E2E ASR, these methods are mainly designed for languages with clear word boundaries such as English and cannot be directly applied to languages like Mandarin, in which each character sequence can have multiple corresponding word sequences. To this end, we propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences for each partial hypothesis. Then, the LM score of the hypothesis is obtained by intersecting the generated lattice with an external word N-gram LM. The proposed method is examined on both Attention-based Encoder-Decoder (AED) and Neural Transducer (NT) frameworks. Experiments suggest that our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM. We achieve state-of-the-art results on both Aishell-1 (CER 4.18%) and Aishell-2 (CER 5.06%) datasets and reduce CER by 14.8% relatively on a 21K-hour Mandarin dataset.

Research on speech recognition models in the Chinese dictation machine

A comparative study on selecting acoustic modeling units in deep neural networks based large vocabulary Chinese speech recognition

Mandarin Pronunciation Modeling Based on CASS Corpus.

A comparable study of modeling units for end-to-end Mandarin speech recognition

Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr

MANDARIN PRONUNCIATION VARIATION MODELING 1

Easytalk: a large-vocabulary speaker-independent Chinese dictation machine

Research on Modeling Units of Transformer Transducer for Mandarin Speech Recognition

Chinese Dialect Speech Recognition Based on End-to-end Machine Learning

Chinese dialect speech recognition: a comprehensive survey

A New Model for Speech Recognition : Center-Distance Continuous Probability Model

Mandarin Continuous Digit Speech Recognition System

Algorithm for Mandarin Continuous Speech Recognition Based on Context-Dependent Unit Between Syllables

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

Research on Context-Dependent Acoustical Unit (Triphone) for Mandarin Continuous Speech Recognition

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Research on acoustic Model of Putian Dialect Speech Recognition Based on Deep Learning

Towards Language-Universal Mandarin-English Speech Recognition

Advances in Cantonese Speech Recognition: A Language-Specific Pretraining Model and RNN-T Loss

Stochastic Language Models for Chinese Speech Recognition Based on Chinese Spelling

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model