InMu-Net: Advancing Multi-modal Intent Detection Via Information Bottleneck and Multi-sensory Processing
Zhihong Zhu,Xuxin Cheng,Zhaorun Chen,Yuyan Chen,Yunyan Zhang,Xian Wu,Yefeng Zheng,Bowen Xing
DOI: https://doi.org/10.1145/3664647.3681623
2024-01-01
Abstract:Multi-modal intent detection (MID) aims to comprehend users' intentions through diverse modalities, which has received widespread attention in dialogue systems. Despite the promising advancements in complex fusion mechanisms or architecture designs, challenges remain due to: (1) various noise and redundancy in both visual and audio modalities and (2) long-tailed distributions of intent categories. In this paper, to tackle the above two issues, we propose InMu-Net, a simple yet effective framework for MID from the Information bottleneck and Multi-sensory processing perspective. Our contributions lie in three aspects. First, we devise a denoising bottleneck module to filter out the intent-irrelevant information in the fused feature; Second, we introduce a saliency preservation loss to prevent the dropping of intent-relevant information; Ultimately, kurtosis regulation is introduced to maintain representation smoothness during the filtering process, mitigating the adverse impact of the long tail distribution. Comprehensive experiments on two MID benchmark datasets demonstrate the effectiveness of InMu-Net and its vital components. Impressively, a series of analyses reveal our denoising potential and robustness in low-resource, modality corruption, cross-architecture and cross-task scenarios.