CiteNet: Cross-modal incongruity perception network for multimodal sentiment prediction

Jie Wang,Yan Yang,Keyu Liu,Zhuyang Xie,Fan Zhang,Tianrui Li
DOI: https://doi.org/10.1016/j.knosys.2024.111848
IF: 8.139
2024-04-25
Knowledge-Based Systems
Abstract:Multimodal sentiment prediction poses a formidable challenge that necessitates a profound understanding of both visual and linguistic cues, as well as the intricate interactions between them. The current achievements of modern systems in this domain can plausibly be attributed to the development of sophisticated cross-modal fusion techniques. Nevertheless, such solutions often handle each modality equally, neglecting the discordant predictions arising from sentiment incongruity in unimodal sources, which may result in performance degradation in conventional extraction-fusion scenarios. In this work, we take a different route–introducing an extraction-estimation-fusion paradigm aimed at exploring more reliable multimodal representations under the supervision of unimodal sentiment prediction. To this end, we propose a C ross-modal I ncongrui T y p E rception NET work, named CiteNet , for multimodal sentiment detection. In CiteNet, we initially develop a cross-modal alignment module tailored to synchronize modality-specific representations through contrastive learning. Subsequently, with a refined cross-modal integration module, CiteNet can achieve a synergistic and comprehensive multimodal representation. In addition, we explore a cross-modal incongruity learning module from an information-theoretic perspective, capable of estimating inherent sentiment disparities by analyzing modal distributions. This incongruity score is then employed as a crucial factor in the adaptive fusion of unimodal and multimodal representations, culminating in enhanced accuracy in sentiment prediction. Experimental results on two datasets demonstrate that CiteNet outperforms prior methods by a significant margin of approximately 1%–11% in accuracy.
computer science, artificial intelligence
What problem does this paper attempt to address?