A Deep Spatiotemporal Interaction Network for Multimodal Sentimental Analysis and Emotion Recognition

Xi-Cheng Li,Feng Zhang,Qiang Hua,Chun-Ru Dong
DOI: https://doi.org/10.1016/j.ins.2024.121515
2025-01-01
Abstract:One of the challenges of sentiment analysis and emotion recognition is how to effectively fuse the multimodal inputs. The transformer-based models have achieved great success in applications of multimodal sentiment analysis and emotion recognition recently. However, the transformerbased model often neglects the coherence of human emotion due to its parallel structure. Additionally, a low-rank bottleneck created by multi- attention-head causes an inadequate fitting ability of models. To tackle these issues, a Deep Spatiotemporal Interaction Network (DSIN) is proposed in this study. It consists of two main components, i.e., a cross-modal transformer with a cross-talking attention module and a hierarchically temporal fusion module, where the crossmodal transformer is used to model the spatial interactions between different modalities and the hierarchically temporal fusion network is utilized to model the temporal coherence of emotion. Therefore, the DSIN can model the spatiotemporal interactions of multimodal inputs by incorporating the time-dependency into the parallel structure of transformer and decrease the redundancy of embedded features by implanting their spatiotemporal interactions into a hybrid memory network in a hierarchical manner. The experimental results on two benchmark datasets indicate that DSIN achieves superior performance compared with the state-of-the-art models, and some useful insights are derived from the results.
What problem does this paper attempt to address?