Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

Anderson R. Avila,Mehdi Rezagholizadeh,Chao Xing
2023-06-13
Abstract:Recent voice assistants are usually based on the cascade spoken language understanding (SLU) solution, which consists of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. Moreover, a multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors present in the ASR transcript. The MLU benefits from self-supervised features learned from both audio and text modalities, specifically Wav2Vec for speech and Bert/RoBERTa for language. Our MLU combines an encoder network to embed the audio signal and a text encoder to process text transcripts followed by a late fusion layer to fuse audio and text logits. We found that the proposed MLU showed to be robust towards poor quality ASR transcripts, while the performance of BERT and RoBERTa are severely compromised. Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.
Computation and Language,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the impact of automatic speech recognition (ASR) error propagation in traditional speech understanding systems on natural language understanding (NLU) performance. Specifically, the paper points out that current voice assistants are usually based on cascaded spoken language understanding (SLU) solutions, which consist of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Since this solution depends on the output of ASR, it often suffers from the so - called ASR error propagation problem, which can seriously affect the performance of the NLU system. To alleviate this problem, the paper proposes a multimodal language understanding (MLU) module, which combines features self - supervised learned from audio and text modalities, especially using Wav2Vec to process speech and BERT or RoBERTa to process language. The MLU module embeds audio signals through an encoding network and processes text transcripts through a text encoder, and then fuses the logits of audio and text through a late - fusion layer. Experimental results show that the proposed MLU model exhibits strong robustness in the face of low - quality ASR transcripts, while the performance of BERT and RoBERTa is severely impaired. The model was evaluated on five tasks in three SLU datasets, and the results show that this method effectively alleviates the ASR error propagation problem and outperforms the performance of pre - trained language model (PLM) models on ASR transcripts generated by academic ASR engines.