Abstract:In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors can degrade subsequent SLU tasks. Here we introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts with the help of word confusion networks from lattices, bridging the SLU performance gap between using the top ASR hypothesis and an oracle upper bound. Additionally, we delve into the LLM's robustness to varying ASR performance conditions and scrutinize the aspects of in-context learning which prove the most influential.

What problem does this paper attempt to address?

This paper attempts to solve the problem of performance degradation in downstream tasks in the Spoken Language Understanding (SLU) task due to output errors of the Automatic Speech Recognition (ASR) system. Specifically: 1. **Impact of ASR Output Errors**: In traditional SLU systems, voice input is first transcribed into text hypotheses (referred to as "1 - best") by the ASR system and then sent to the Natural Language Understanding (NLU) pipeline for processing. However, this 1 - best hypothesis often contains errors, which can significantly affect the performance of downstream tasks. 2. **Limitations of Existing Methods**: Although previous studies have attempted to alleviate this problem by providing other information from the ASR system (such as n - best hypotheses, lattice graphs, etc.), these methods usually rely on specific models or require complex pre - processing steps and have limited effectiveness in practical applications. 3. **Introduction of Word Confusion Networks (WCNs)**: To solve the above problems, this paper proposes a new method, that is, using the lattice graph output of the ASR system instead of relying only on the top hypothesis. By representing WCNs as strings and inputting them into large - language models (LLMs), it aims to capture the ambiguity of speech and enhance the performance of SLU tasks. 4. **Objectives**: Specifically, the objectives of this paper are: - To study whether LLMs can show stronger robustness in the face of ASR errors and ambiguity through in - context learning and WCNs representation. - To explore the effectiveness of different LLM sizes for this method. - To analyze the performance of WCNs under different ASR performance conditions and which aspects are most influential for in - context learning. Through these studies, the authors hope to improve the performance of LLMs in processing noisy speech transcripts without model fine - tuning, only through the improvement of input representation.

Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding

Discriminative Spoken Language Understanding Using Word Confusion Networks

Leveraging Large Language Models for Exploiting ASR Uncertainty

Robust Spoken Language Understanding with Acoustic and Domain Knowledge

Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

Robust Spoken Language Understanding With Unsupervised Asr-Error Adaptation

MCLF: A Multi-grained Contrastive Learning Framework for ASR-robust Spoken Language Understanding

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

ASR-Robust Spoken Language Understanding on ASR-GLUE dataset

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring

Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding

Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition.

On joint training with interfaces for spoken language understanding

Using Word Confusion Networks for Slot Filling in Spoken Language Understanding.

Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study

C²A-SLU: Cross and Contrastive Attention for Improving ASR Robustness in Spoken Language Understanding

Knowing Where to Leverage: Context-Aware Graph Convolutional Network with an Adaptive Fusion Layer for Contextual Spoken Language Understanding.

A Survey on Speech Large Language Models

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward