Abstract:Recent voice assistants are usually based on the cascade spoken language understanding (SLU) solution, which consists of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. Moreover, a multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors present in the ASR transcript. The MLU benefits from self-supervised features learned from both audio and text modalities, specifically Wav2Vec for speech and Bert/RoBERTa for language. Our MLU combines an encoder network to embed the audio signal and a text encoder to process text transcripts followed by a late fusion layer to fuse audio and text logits. We found that the proposed MLU showed to be robust towards poor quality ASR transcripts, while the performance of BERT and RoBERTa are severely compromised. Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the impact of automatic speech recognition (ASR) error propagation in traditional speech understanding systems on natural language understanding (NLU) performance. Specifically, the paper points out that current voice assistants are usually based on cascaded spoken language understanding (SLU) solutions, which consist of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Since this solution depends on the output of ASR, it often suffers from the so - called ASR error propagation problem, which can seriously affect the performance of the NLU system. To alleviate this problem, the paper proposes a multimodal language understanding (MLU) module, which combines features self - supervised learned from audio and text modalities, especially using Wav2Vec to process speech and BERT or RoBERTa to process language. The MLU module embeds audio signals through an encoding network and processes text transcripts through a text encoder, and then fuses the logits of audio and text through a late - fusion layer. Experimental results show that the proposed MLU model exhibits strong robustness in the face of low - quality ASR transcripts, while the performance of BERT and RoBERTa is severely impaired. The model was evaluated on five tasks in three SLU datasets, and the results show that this method effectively alleviates the ASR error propagation problem and outperforms the performance of pre - trained language model (PLM) models on ASR transcripts generated by academic ASR engines.

Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding

Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding

Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding

Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-trained DNN-HMM-Based Acoustic-Phonetic Model

On joint training with interfaces for spoken language understanding

Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning

End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining.

Tie Your Embeddings Down: Cross-Modal Latent Spaces for End-to-end Spoken Language Understanding

MCLF: A Multi-grained Contrastive Learning Framework for ASR-robust Spoken Language Understanding

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

Multimodal Speech Recognition for Language-Guided Embodied Agents

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

Understanding Semantics from Speech Through Pre-training

Leveraging Large Language Models for Exploiting ASR Uncertainty

Bridging Speech and Textual Pre-trained Models with Unsupervised ASR.

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models