MATF: main-auxiliary transformer fusion for multi-modal sentiment analysis

Shao, Xinhui
DOI: https://doi.org/10.1007/s00500-024-09886-7
IF: 3.732
2024-09-21
Soft Computing
Abstract:Videos or Vlogs primarily contain three types of information: natural language, facial movements, and audio activity. These three types of information can effectively portray the publisher's possible attitude. Most recent studies on the multimodal sentiment analysis challenge have focused on two angles. The first is to examine each mode in isolation, but doing so will ignore the interaction between them. The alternative is to integrate information from several modes and treat it all equally, but doing so will add a lot of noise input and disregard critical modes' dominant positions (such as text). The main fusion module based on the mask attention mechanism and the dynamic aggregation attention mechanism, together with the auxiliary fusion module based on the directed cross-modal transformer, can solve these challenges. At the same time, the issue of data misalignment caused by the various sampling rates of each modal sequence is also taken into account in this paper. Our model can deal with the ``unaligned'' data, pay greater attention to the crucial modes and features of sentiment information, and reduce the noise input by using the main and auxiliary fusion mechanisms. The model has undergone thorough testing in two publicly accessible multimodal datasets (CMU-MOSI and CMU-MOSEI), and the findings show that our model performs significantly better than state-of-the-art models.
computer science, artificial intelligence, interdisciplinary applications
What problem does this paper attempt to address?