A Dual RNN Semantic Analysis Framework for Intent Classification and Slot
Hua Xu,Hanlei Zhang,Ting-En Lin
DOI: https://doi.org/10.1007/978-981-99-3885-8_4
2023-01-01
Abstract:The research on spoken language understanding (SLU) system has progressed extremely fast during the past decades. Intent detection and slot filling are two main tasks for building a SLU system. Multiple deep learning based models have demonstrated good results on these tasks. The most effective algorithms are based on the structures of sequence to sequence models (or “encoder-decoder” models), and generate the intents and semantic tags either using separate models (Yao K, et al., Spoken language understanding using long short-term memory neural networks, South Lake Tahoe. 189–194, 2014; Mesnil, et al. IEEE/ACM Trans Audio Speech Lang Process, 23:530–539 2014; Peng, et al. Recurrent neural networks with external memory for spoken language understanding. Proceedings of the 2015 Natural Language Processing and Chinese Computing. 9362:25–35, 2015; Kurata, et al. Leveraging sentence-level information with encoder LSTM for semantic slot filling. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2077–2083, 2016; Hahn, et al. Inst Electr Electron Eng Trans Audio Speech Lang Process. 19:1569–1583, 2011) or a joint model (Liu and Lane. Attention-based recurrent neural network models for joint intent detection and slot filling. Proceedings of the Interspeech. 685–689, 2016; Hakkani-Tür et al., Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM. Proceedings of the Interspeech. 715-719, 2016; Guo, et al. Joint semantic utterance classification and slot filling with recursive neural networks. Proceedings of the 2014 Institute of Electrical and Electronics Engineers Spoken Language Technology Workshop. 554–559, 2014). Most of the previous studies, however, either treat the intent detection and slot filling as two separate parallel tasks, or use a sequence to sequence model to generate both semantic tags and intent. Most of these approaches use one (joint) NN based model (including encoder-decoder structure) to model two tasks, hence may not fully take advantage of the cross impact between them. In this chapter, new Bi-model based RNN semantic frame parsing network structures are designed to perform the intent detection and slot filling tasks jointly, by considering their cross-impact to each other using two correlated bidirectional LSTMs (BLSTM). The Bi-model structure with a decoder achieves state-of-the-art results on the benchmark ATIS data (Charles, et al. The atis spoken language systems pilot corpus. Proceedings of a Workshop Held at Hidden Valley. 96–101, 1990; Tur G et al. What is left to be understood in atis? IEEE Spoken Language Technology Workshop. 19–24, 2010), with about 0.5% intent accuracy improvement and 0.9% slot filling improvement.