Abstract:Large language models (LLMs) have shown superb capability of modeling multimodal signals including audio and text, allowing the model to generate spoken or textual response given a speech input. However, it remains a challenge for the model to recognize personal named entities, such as contacts in a phone book, when the input modality is speech. In this work, we start with a speech recognition task and propose a retrieval-based solution to contextualize the LLM: we first let the LLM detect named entities in speech without any context, then use this named entity as a query to retrieve phonetically similar named entities from a personal database and feed them to the LLM, and finally run context-aware LLM decoding. In a voice assistant task, our solution achieved up to 30.2% relative word error rate reduction and 73.6% relative named entity error rate reduction compared to a baseline system without contextualization. Notably, our solution by design avoids prompting the LLM with the full named entity database, making it highly efficient and applicable to large named entity databases.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how large language models (LLMs) can effectively recognize personal named entities (such as contact names) in automatic speech recognition (ASR) tasks. Specifically, when the input modality is speech, existing LLMs have difficulty accurately recognizing personal named entities without context. This is especially important in application scenarios such as voice assistants, because these applications need to be able to recognize personal information such as names in the user's contact list. To solve this problem, the author proposes a retrieval - enhanced method to contextualize LLMs. This method mainly includes three steps: 1. **Named Entity Detection**: First, let the LLM detect named entities in the speech without any context. 2. **Phoneme - Based Retrieval**: Use the detected named entities as queries to retrieve phonetically similar named entities from the personal database. 3. **Context - Aware Generation**: Provide the retrieved named entities to the LLM for context - aware decoding generation. Through this method, the author has achieved a significant performance improvement in the voice assistant task. Compared with the baseline system, the relative word error rate (WER) is reduced by 30.2%, and the relative named entity error rate (NER) is reduced by 73.6%. In addition, this method is designed to avoid providing the complete named entity database to the LLM, thereby improving efficiency and being suitable for large - scale named entity databases.

Contextualization of ASR with LLM using phonetic retrieval-based augmentation

Contextual Biasing of Named-Entities with Large Language Models

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

CTC-Assisted LLM-Based Contextual ASR

End-to-End Speech Recognition Contextualization with Large Language Models

Using Large Language Model for End-to-End Chinese ASR and NER

Contextual Spelling Correction with Large Language Models

Enhancing Large Language Model-based Speech Recognition by Contextualization for Rare and Ambiguous Words

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models

ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study

Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

A Discriminative Entity-Aware Language Model for Virtual Assistants

Server-side Rescoring of Spoken Entity-centric Knowledge Queries for Virtual Assistants

Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring

Leveraging Large Language Models for Exploiting ASR Uncertainty

Prompting Large Language Models with Speech Recognition Abilities