Abstract:Currently, in spoken language understanding (SLU) systems, the automatic speech recognition (ASR) module produces multiple interpretations (or hypotheses) for the input audio signal and the natural language understanding (NLU) module takes the one with the highest confidence score for domain or intent classification. However, the interpretations can be noisy, and solely relying on one interpretation can cause information loss. To address the problem, many research works attempt to rerank the interpretations for a better choice while some recent works get better performance by integrating all the hypotheses during prediction. In this paper, we follow the way of integrating hypotheses but strengthen the training mode by involving more tasks, some of which may be not in existing tasks of NLU but relevant, via multi-task learning or transfer learning. Moreover, we propose the Hierarchical Attention Mechanism (HAM) to further improve the performance with the acoustic-model features like confidence scores, which are ignored in the current hypotheses integration models. The experimental results show that compared to the standard estimation with one hypothesis, the multi-task learning with HAM can improve the domain and intent classification by relatively 19% and 37%, which are much higher than improvements with current integration or reranking methods. To illustrate the cause of improvements brought by our model, we decode the hidden representations of some utterance examples and compare the generated texts with hypotheses and transcripts. The comparison shows that our model could recover the transcription by integrating the fragmented information among hypotheses and identifying the frequent error patterns of the ASR module, and even rewrite the query for a better understanding, which reveals the characteristic of multi-task learning of broadcasting knowledge.

Integrating Multiple ASR Systems into NLP Backend with Attention Fusion

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Attention-based Multi-hypothesis Fusion for Speech Summarization

ASR N-BEST FUSION NETS

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

On Efficient Coupling of ASR and SMT for Speech Translation

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition

Large-vocabulary Audio-visual Speech Recognition in Noisy Environments

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

Combining Hybrid DNN-HMM ASR Systems with Attention-Based Models Using Lattice Rescoring

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Non-Autoregressive End-To-End Automatic Speech Recognition Incorporating Downstream Natural Language Processing

EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed

Improving Multi-Speaker ASR With Overlap-Aware Encoding And Monotonic Attention.

Transfer learning of language-independent end-to-end ASR with language model fusion

Multi-task Learning of Spoken Language Understanding by Integrating N-Best Hypotheses with Hierarchical Attention.

Spectrograms Fusion-based End-to-end Robust Automatic Speech Recognition

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR