Abstract:In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding; the on-the-fly decoding process in MBR-based methods results in the need for pre-trained models and slow training speeds. To this end, novel algorithms are proposed in this work to integrate another widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI), into E2E ASR systems not only in the training stage but also in the decoding process. The proposed LF-MMI training and decoding methods show their effectiveness on two widely used E2E frameworks: Attention-Based Encoder-Decoders (AEDs) and Neural Transducers (NTs). Compared with MBR-based methods, the proposed LF-MMI method: maintains the consistency between training and decoding; eschews the on-the-fly decoding process; trains from randomly initialized models with superior training efficiency. Experiments suggest that the LF-MMI method outperforms its MBR counterparts and consistently leads to statistically significant performance improvements on various frameworks and datasets from 30 hours to 14.3k hours. The proposed method achieves state-of-the-art (SOTA) results on Aishell-1 (CER 4.10%) and Aishell-2 (CER 5.02%) datasets. Code is released.

Discriminative Speaker Adaptation with Eigenvoices

Agmma: A Novel Incremental Adaptation Method And Its Application To Speaker Recognition

Rapid discriminative acoustic model based on eigenspace mapping for fast speaker adaptation

Improving Online Incremental Speaker Adaptation with Eigen Feature Space MLLR.

Eigenvoice-based MAP Adaptation Within Correlation Subspace

HMM training method based on evolutionary computation and MDI in speech recognition

Speaker adaptation using maximum likelihood model interpolation

Scores Selection for Emotional Speaker Recognition

A Speaker Adaptation Algorithm Based on Matrix Linear Interpolation

Eigenvoice-based MAP Fast Adaptation in Correlation Subspaces

A New Subspace Based Speaker Adaptation Method

Rapid Speaker Adaptation Using Multi-Stream Structural Maximum Likelihood Eigenspace Mapping

MAP-based Speaker Adaptation in Speech Synthesis

Model Adaptation for HMM-Based Speech Synthesis under Minimum Generation Error Criterion

Dynamic Speaker Selected Training for Rapid Speaker Adaptation

Eigenspace Estimation With Missing Values And Its Application To Eigenvoice Adaptation For Speech Recognition

Latent Correlation Analysis of HMM Parameters for Speech Recognition

Integrating Lattice-Free MMI into End-to-End Speech Recognition

Speaker Adaptation of Hybrid NN/HMM Model for Speech Recognition Based on Singular Value Decomposition

Comparison of adaptation methods for GMM-SVM based speech emotion recognition

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation