Abstract:Recent voice assistants are usually based on the cascade spoken language understanding (SLU) solution, which consists of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. Moreover, a multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors present in the ASR transcript. The MLU benefits from self-supervised features learned from both audio and text modalities, specifically Wav2Vec for speech and Bert/RoBERTa for language. Our MLU combines an encoder network to embed the audio signal and a text encoder to process text transcripts followed by a late fusion layer to fuse audio and text logits. We found that the proposed MLU showed to be robust towards poor quality ASR transcripts, while the performance of BERT and RoBERTa are severely compromised. Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.

Multi-Classification Model for Spoken Language Understanding

Hierarchical and Bidirectional Joint Multi-Task Classifiers for Natural Language Understanding

Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning

End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining.

Semi-Supervised Spoken Language Understanding Via Self-Supervised Speech and Language Model Pretraining.

End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders

A BiRGAT Model for Multi-intent Spoken Language Understanding with Hierarchical Semantic Frames

On joint training with interfaces for spoken language understanding

Using Bidirectional Transformer-CRF for Spoken Language Understanding.

Towards Multi-Intent Spoken Language Understanding Via Hierarchical Attention and Optimal Transport

Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding

Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

FC-MTLF: A Fine- and Coarse-grained Multi-Task Learning Framework for Cross-Lingual Spoken Language Understanding.

Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-trained DNN-HMM-Based Acoustic-Phonetic Model

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

OpenSLU: A Unified, Modularized, and Extensible Toolkit for Spoken Language Understanding

Understanding Semantics from Speech Through Pre-training

MCLF: A Multi-grained Contrastive Learning Framework for ASR-robust Spoken Language Understanding

A Joint and Domain-Adaptive Approach to Spoken Language Understanding

A Self-Attention Joint Model for Spoken Language Understanding in Situational Dialog Applications

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages