Contextual Augmented Global Contrast for Multimodal Intent Recognition

Kaili Sun,Zhiwen Xie,Mang Ye,Huyin Zhang
DOI: https://doi.org/10.1109/cvpr52733.2024.02546
2024-01-01
Abstract:Multimodal intent recognition (MIR) aims to perceive the human intent polarity via language, visual, and acoustic modalities. The inherent intent ambiguity makes it challenging to recognize in multimodal scenarios. Existing MIR methods tend to model the individual video independently, ignoring global contextual information across videos. This learning manner inevitably introduces perception biases, exacerbated by the inconsistencies of the multimodal representation, amplifying the intent uncertainty. This challenge motivates us to explore effective global context modeling. Thus, we propose a context-augmented global contrast (CAGC) method to capture rich global context features by mining both intra-and cross-video context interactions for MIR. Concretely, we design a context-augmented transformer module to extract global context dependencies across videos. To further alleviate error accumulation and interference, we develop a cross-video bank that retrieves effective video sources by considering both intentional tendency and video similarity. Furthermore, we introduce a global context-guided contrastive learning scheme, designed to mitigate inconsistencies arising from global context and individual modalities in different feature spaces. This scheme incorporates global cues as the supervision to capture robust the multimodal intent representation. Experiments demonstrate CAGC obtains superior performance than state-of-the-art MIR methods. We also generalize our approach to a closely related task, multimodal sentiment analysis, achieving the comparable performance.
What problem does this paper attempt to address?