Abstract:Due to the dynamic nature of human language, automatic speech recognition (ASR) systems need to continuously acquire new vocabulary. Out-Of-Vocabulary (OOV) words, such as trending words and new named entities, pose problems to modern ASR systems that require long training times to adapt their large numbers of parameters. Different from most previous research focusing on language model post-processing, we tackle this problem on an earlier processing level and eliminate the bias in acoustic modeling to recognize OOV words acoustically. We propose to generate OOV words using text-to-speech systems and to rescale losses to encourage neural networks to pay more attention to OOV words. Specifically, we enlarge the classification loss used for training neural networks' parameters of utterances containing OOV words (sentence-level), or rescale the gradient used for back-propagation for OOV words (word-level), when fine-tuning a previously trained model on synthetic audio. To overcome catastrophic forgetting, we also explore the combination of loss rescaling and model regularization, i.e. L2 regularization and elastic weight consolidation (EWC). Compared with previous methods that just fine-tune synthetic audio with EWC, the experimental results on the LibriSpeech benchmark reveal that our proposed loss rescaling approach can achieve significant improvement on the recall rate with only a slight decrease on word error rate. Moreover, word-level rescaling is more stable than utterance-level rescaling and leads to higher recall rates and precision on OOV word recognition. Furthermore, our proposed combined loss rescaling and weight consolidation methods can support continual learning of an ASR system.

Class-Based Neural Network Language Model for Second-Pass Rescoring in ASR

Speech Recognition Rescoring with Large Speech-Text Foundation Models

LT-LM: a novel non-autoregressive language model for single-shot lattice rescoring

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Empirically Combining Unnormalized NNLM and Back-off N -Gram for Fast N -Best Rescoring in Speech Recognition

Discriminative Speech Recognition Rescoring with Pre-trained Language Models

Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models

Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition

Large-scale Language Model Rescoring on Long-form Data

Context-aware RNNLM Rescoring for Conversational Speech Recognition

Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End Speech Recognition

Neural Lattice Search for Speech Recognition.

Neural Network Language Modeling With Letter-Based Features And Importance Sampling

A Rescoring Approach for Keyword Search Using Lattice Context Information.

Transformer-based Model for ASR N-Best Rescoring and Rewriting

Efficient One-Pass Decoding with Nnlm for Speech Recognition

Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking

Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model

Combining Hybrid DNN-HMM ASR Systems with Attention-Based Models Using Lattice Rescoring

A Study on Neural Network Language Modeling

Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over