Abstract:In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding; the on-the-fly decoding process in MBR-based methods results in the need for pre-trained models and slow training speeds. To this end, novel algorithms are proposed in this work to integrate another widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI), into E2E ASR systems not only in the training stage but also in the decoding process. The proposed LF-MMI training and decoding methods show their effectiveness on two widely used E2E frameworks: Attention-Based Encoder-Decoders (AEDs) and Neural Transducers (NTs). Compared with MBR-based methods, the proposed LF-MMI method: maintains the consistency between training and decoding; eschews the on-the-fly decoding process; trains from randomly initialized models with superior training efficiency. Experiments suggest that the LF-MMI method outperforms its MBR counterparts and consistently leads to statistically significant performance improvements on various frameworks and datasets from 30 hours to 14.3k hours. The proposed method achieves state-of-the-art (SOTA) results on Aishell-1 (CER 4.10%) and Aishell-2 (CER 5.02%) datasets. Code is released.

Maximum-a-Posteriori-Based Decoding for End-to-End Acoustic Models

Speech neuromuscular decoding based on spectrogram images using conformal predictors with Bi-LSTM.

A Deliberation-based Joint Acoustic and Text Decoder

Integrating Lattice-Free MMI into End-to-End Speech Recognition

An Asynchronous WFST-Based Decoder for Automatic Speech Recognition

Acoustic Model Fusion for End-to-end Speech Recognition

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Maxout Neurons Based Deep Bidirectional Lstm For Acoustic Modeling

Modular End-to-End Automatic Speech Recognition Framework for Acoustic-to-Word Model

Auxiliary Multimodal LSTM for Audio-visual Speech Recognition and Lipreading

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

ON MODULAR TRAINING OF NEURAL ACOUSTICS-TO-WORD MODEL FOR LVCSR

On decoder-only architecture for speech-to-text and large language model integration

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

Comparison of Decoding Strategies for CTC Acoustic Models

4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Decoder-only Architecture for Streaming End-to-end Speech Recognition

A Thorough Examination of Decoding Methods in the Era of LLMs

An fMRI-based auditory decoding framework combined with convolutional neural network for predicting the semantics of real-life sounds from brain activity

AADNet: An End-to-End Deep Learning Model for Auditory Attention Decoding