Prompt Learning for Multimodal Intent Recognition with Modal Alignment Perception

Yuzhao Chen,Wenhua Zhu,Weilun Yu,Hongfei Xue,Hao Fu,Jiali Lin,Dazhi Jiang
DOI: https://doi.org/10.1007/s12559-024-10328-7
IF: 4.89
2024-01-01
Cognitive Computation
Abstract:Multimodal intent recognition analysis is a crucial task in understanding user intent through speech, body movements, tone, and other modalities in real-world multimodal environments. However, due to the hidden nature of intent within and across modalities, most existing methods still have limitations in excavating and integrating multimodal intent information. This paper introduces a prompt learning with modal alignment perception (PMAP) approach to address these challenges. First, for excavating deep-level semantic information, the intent templates are constructed for prompt learning to enhance text representations. Then, cross-modal alignment perception is leveraged to eliminate modality discrepancies while excavating consistent hidden intent information from non-text modalities. Through multimodal semantic interaction, the position of text in the semantic space is fine-tuned, which effectively aggregates intent details from multiple modalities. Extensive experiments demonstrate that our method achieves significant improvements.
What problem does this paper attempt to address?