Abstract:Despite the rapid progress of end-to-end (E2E) automatic speech recognition (ASR), it has been shown that incorporating external language models (LMs) into the decoding can further improve the recognition performance of E2E ASR systems. To align with the modeling units adopted in E2E ASR systems, subword-level (e.g., characters, BPE) LMs are usually used to cooperate with current E2E ASR systems. However, the use of subword-level LMs will ignore the word-level information, which may limit the strength of the external LMs in E2E ASR. Although several methods have been proposed to incorporate word-level external LMs in E2E ASR, these methods are mainly designed for languages with clear word boundaries such as English and cannot be directly applied to languages like Mandarin, in which each character sequence can have multiple corresponding word sequences. To this end, we propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences for each partial hypothesis. Then, the LM score of the hypothesis is obtained by intersecting the generated lattice with an external word N-gram LM. The proposed method is examined on both Attention-based Encoder-Decoder (AED) and Neural Transducer (NT) frameworks. Experiments suggest that our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM. We achieve state-of-the-art results on both Aishell-1 (CER 4.18%) and Aishell-2 (CER 5.06%) datasets and reduce CER by 14.8% relatively on a 21K-hour Mandarin dataset.

An analysis of incorporating an external language model into a sequence-to-sequence model

Transfer learning of language-independent end-to-end ASR with language model fusion

Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition

Early Stage LM Integration Using Local and Global Log-Linear Combination

Memory Attentive Fusion: External Language Model Integration for Transformer-based Sequence-to-Sequence Model

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Acoustic Model Fusion for End-to-end Speech Recognition

Iterative Shallow Fusion of Backward Language Model for End-to-End Speech Recognition

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation

An Empirical Study of Language Model Integration for Transducer Based Speech Recognition

Large-scale Language Model Rescoring on Long-form Data

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Aligning Speech to Languages to Enhance Code-switching Speech Recognition

Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR with Internal Language Model Estimation

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

Monolingual Recognizers Fusion for Code-switching Speech Recognition

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

Effective internal language model training and fusion for factorized transducer model