Abstract:Attention-based contextual biasing approaches have shown significant improvements in the recognition of generic and/or personal rare-words in End-to-End Automatic Speech Recognition (E2E ASR) systems like neural transducers. These approaches employ cross-attention to bias the model towards specific contextual entities injected as bias-phrases to the model. Prior approaches typically relied on subword encoders for encoding the bias phrases. However, subword tokenizations are coarse and fail to capture granular pronunciation information which is crucial for biasing based on acoustic similarity. In this work, we propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing guided by acoustic similarity between the audio and the contextual entities (termed acoustic biasing). We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context along with contextual entities to perform biasing informed by the utterance's semantic context (termed semantic biasing). Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes over the baseline contextual model when incorporating our proposed acoustic and semantic biasing approach. On a large-scale in-house dataset, we observe 7.91% relative WER improvement compared to our baseline model. On tail utterances, the improvements are even more pronounced with 36.80% and 23.40% relative WER improvements on Librispeech rare words and an in-house testset respectively.

N-gram Boosting: Improving Contextual Biasing with Normalized N-gram Targets

Robust Acoustic and Semantic Contextual Biasing in Neural Transducers for Speech Recognition

Spell my name: keyword boosted speech recognition

InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions

Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation

Text Injection for Neural Contextual Biasing

Optimizing Contextual Speech Recognition Using Vector Quantization for Efficient Retrieval

Boosting Tail Neural Network for Realtime Custom Keyword Spotting

Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation

LM-assisted keyword biasing with Aho-Corasick algorithm for Transducer-based ASR

Discriminative Boosting Algorithm for Diversified Front-End Phonotactic Language Recognition

XFBoost: Improving Text Generation with Controllable Decoders

Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model

Keyword-Guided Adaptation of Automatic Speech Recognition

Minimising Biasing Word Errors for Contextual ASR with the Tree-Constrained Pointer Generator

Empirically Combining Unnormalized NNLM and Back-off N -Gram for Fast N -Best Rescoring in Speech Recognition

Identifying Language Origin Of Person Names With N-Grams Of Different Units

Discriminative Boosting Regression Backend for Phonotactic Language Recognition

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss