Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

Kevin Everson,Yile Gu,Huck Yang,Prashanth Gurunath Shivakumar,Guan-Ting Lin,Jari Kolehmainen,Ivan Bulyko,Ankur Gandhe,Shalini Ghosh,Wael Hamza,Hung-yi Lee,Ariya Rastrow,Andreas Stolcke
2024-01-06
Abstract:In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors can degrade subsequent SLU tasks. Here we introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts with the help of word confusion networks from lattices, bridging the SLU performance gap between using the top ASR hypothesis and an oracle upper bound. Additionally, we delve into the LLM's robustness to varying ASR performance conditions and scrutinize the aspects of in-context learning which prove the most influential.
Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the problem of performance degradation in downstream tasks in the Spoken Language Understanding (SLU) task due to output errors of the Automatic Speech Recognition (ASR) system. Specifically: 1. **Impact of ASR Output Errors**: In traditional SLU systems, voice input is first transcribed into text hypotheses (referred to as "1 - best") by the ASR system and then sent to the Natural Language Understanding (NLU) pipeline for processing. However, this 1 - best hypothesis often contains errors, which can significantly affect the performance of downstream tasks. 2. **Limitations of Existing Methods**: Although previous studies have attempted to alleviate this problem by providing other information from the ASR system (such as n - best hypotheses, lattice graphs, etc.), these methods usually rely on specific models or require complex pre - processing steps and have limited effectiveness in practical applications. 3. **Introduction of Word Confusion Networks (WCNs)**: To solve the above problems, this paper proposes a new method, that is, using the lattice graph output of the ASR system instead of relying only on the top hypothesis. By representing WCNs as strings and inputting them into large - language models (LLMs), it aims to capture the ambiguity of speech and enhance the performance of SLU tasks. 4. **Objectives**: Specifically, the objectives of this paper are: - To study whether LLMs can show stronger robustness in the face of ASR errors and ambiguity through in - context learning and WCNs representation. - To explore the effectiveness of different LLM sizes for this method. - To analyze the performance of WCNs under different ASR performance conditions and which aspects are most influential for in - context learning. Through these studies, the authors hope to improve the performance of LLMs in processing noisy speech transcripts without model fine - tuning, only through the improvement of input representation.