Abstract:Currently, in spoken language understanding (SLU) systems, the automatic speech recognition (ASR) module produces multiple interpretations (or hypotheses) for the input audio signal and the natural language understanding (NLU) module takes the one with the highest confidence score for domain or intent classification. However, the interpretations can be noisy, and solely relying on one interpretation can cause information loss. To address the problem, many research works attempt to rerank the interpretations for a better choice while some recent works get better performance by integrating all the hypotheses during prediction. In this paper, we follow the way of integrating hypotheses but strengthen the training mode by involving more tasks, some of which may be not in existing tasks of NLU but relevant, via multi-task learning or transfer learning. Moreover, we propose the Hierarchical Attention Mechanism (HAM) to further improve the performance with the acoustic-model features like confidence scores, which are ignored in the current hypotheses integration models. The experimental results show that compared to the standard estimation with one hypothesis, the multi-task learning with HAM can improve the domain and intent classification by relatively 19% and 37%, which are much higher than improvements with current integration or reranking methods. To illustrate the cause of improvements brought by our model, we decode the hidden representations of some utterance examples and compare the generated texts with hypotheses and transcripts. The comparison shows that our model could recover the transcription by integrating the fragmented information among hypotheses and identifying the frequent error patterns of the ASR module, and even rewrite the query for a better understanding, which reveals the characteristic of multi-task learning of broadcasting knowledge.

ASR N-BEST FUSION NETS

Improving Spoken Language Understanding By Exploiting ASR N-best Hypotheses

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

Acoustic Model Fusion for End-to-end Speech Recognition

Integrating Multiple ASR Systems into NLP Backend with Attention Fusion

ASR-Robust Spoken Language Understanding on ASR-GLUE dataset

Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking

Natural Language Inference Using Lstm Model With Sentence Fusion

Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks

ASR-GLUE: A New Multi-task Benchmark for ASR-Robust Natural Language Understanding

Customization of the ASR System for ATC Speech with Improved Fusion

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Robust Spoken Language Understanding With Unsupervised Asr-Error Adaptation

Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition

It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition

CATNet: Cross-modal fusion for audio-visual speech recognition

Multi-task Learning of Spoken Language Understanding by Integrating N-Best Hypotheses with Hierarchical Attention.

Spectrograms Fusion-based End-to-end Robust Automatic Speech Recognition

Monolingual Recognizers Fusion for Code-switching Speech Recognition

Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding