A transformer-encoder-based multimodal multi-attention fusion network for sentiment analysis

Cong Liu,Yong Wang,Jing Yang
DOI: https://doi.org/10.1007/s10489-024-05623-7
IF: 5.3
2024-06-28
Applied Intelligence
Abstract:Feature fusion for multimodal sentiment analysis is a challenging but worthwhile research topic. With the extension of the time dimension, there are interactions between multimodal signals and the lack of control over the target modal representations during the fusion process leads to erroneous shifts of vectors in the feature space. Moreover, ignoring the representation of target modal features under different fusion orders may lead to insufficient fusion of complementary information. To address the above issues, this paper proposes a transformer-encoder-based multimodal multi-attention fusion network model. The model constructs a multi-attention fusion transformer-encoder to learn inter-modal consistent features and enhance the representation of target modal features. Meanwhile, for each target modality, we construct multi-attention fusion transformer-encoder with different fusion orders in the model to capture the complementary features among the sequences with different fusion orders. Then, the three target modal representations containing consistent features and complementary features are fused with initial features through residual connections to guide the final sentiment analysis. We conduct extensive experiments on three public multimodal datasets. The results show that our approach outperforms the compared multimodal sentiment analysis methods on most metrics and can explain the contributions of inter- and intra-modal interactions in multiple modalities.
computer science, artificial intelligence
What problem does this paper attempt to address?