Abstract:Recently, Target-oriented Multimodal Sentiment Classification (TMSC) has gained significant attention among scholars. However, current multimodal models have reached a performance bottleneck. To investigate the causes of this problem, we perform extensive empirical evaluation and in-depth analysis of the datasets to answer the following questions: Q1: Are the modalities equally important for TMSC? Q2: Which multimodal fusion modules are more effective? Q3: Do existing datasets adequately support the research? Our experiments and analyses reveal that the current TMSC systems primarily rely on the textual modality, as most of targets' sentiments can be determined solely by text. Consequently, we point out several directions to work on for the TMSC task in terms of model design and dataset construction. The code and data can be found in <a class="link-external link-https" href="https://github.com/Junjie-Ye/RethinkingTMSC" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the bottleneck problem in the performance improvement of current multimodal models in the Target - oriented Multimodal Sentiment Classification (TMSC) task. Specifically, through extensive empirical research and in - depth dataset analysis, the author explores the following core issues: 1. **Importance of different modalities**: Are the importance of each modality in the TMSC task the same? 2. **Effectiveness of multimodal fusion modules**: Which multimodal fusion modules are more effective? 3. **Support degree of existing datasets**: Are the existing datasets sufficient to support the research of TMSC? ### Main contributions 1. **Effectiveness of model structure**: The influence of different unimodal encoders and multimodal fusion modules on the TMSC task has been studied. 2. **Limitations of existing datasets**: The limitations of existing widely - used datasets (such as Twitter15 and Twitter17) have been analyzed in - depth. 3. **Future research directions**: Beneficial observations and suggestions for future TMSC model design and dataset construction have been proposed. ### Experimental results 1. **Importance of text modality**: The experimental results show that text - only models (such as BERT) perform well, while vision - only models (such as ResNet, ViT, Faster R - CNN) perform poorly, indicating that in these datasets, text information is more important than image information. 2. **Impact of fusion methods**: Different multimodal fusion methods have a significant impact on model performance. In particular, those fusion modules that mainly obtain text information (such as Image2Text) perform better, which once again confirms the inconsistency in the importance of text and image. 3. **Limitations of multimodal models**: Compared with text - only models, various multimodal fusion modules do not have a significant performance improvement, and some even perform worse. This is because some images do not provide relevant information but introduce interference information instead. 4. **Impact of image encoders**: The performance differences among different image encoders are not large, and they perform similarly in the multimodal fusion setting. This may be due to the characteristics of visual data in the existing datasets. ### Dataset analysis 1. **Sample size and label distribution**: The sample size of the dataset is small, and the average number of targets per sample is less than 1.5. In addition, the sentiment label distribution is unbalanced, with neutral sentiment accounting for about 50% and negative sentiment accounting for less than 15%. 2. **Consistency of multimodal sentiment**: Multimodal sentiment is highly consistent with text sentiment, but has a lower consistency with visual sentiment. For example, in Twitter15, 93% of the targets are consistent in text and multimodal sentiment, while only 47.5% are consistent in visual sentiment. 3. **Existence of targets in images**: A large number of targets do not exist in images, which is not suitable for the target - oriented multimodal sentiment classification task. 4. **Emotion jointly determined**: Only a small amount of data has emotions jointly determined by text and image. For example, in Twitter15, only 22% of the data requires considering both text and image for sentiment classification. ### Conclusions and future work 1. **Model design**: Make full use of the advantages of text information, design more effective image encoding methods, and enhance the noise immunity of the fusion module. 2. **Dataset construction**: Propose the characteristics that a high - quality TMSC dataset should have, including the accuracy of real - world data distribution, data diversity, and multi - dimensional annotation information. Through these studies, the author hopes to provide valuable insights and directions for future TMSC research.

RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal Sentiment Classification

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Adapting BERT for Target-Oriented Multimodal Sentiment Classification

Learning from Adjective-Noun Pairs: A Knowledge-enhanced Framework for Target-Oriented Multimodal Sentiment Classification.

Two-Level Multimodal Fusion for Sentiment Analysis in Public Security

Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Multimodal Sentiment Analysis with Temporal Modality Modality

TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Toward Robust Multimodal Learning using Multimodal Foundational Models

Modality translation-based multimodal sentiment analysis under uncertain missing modalities

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

TCMT: Target-oriented Cross Modal Transformer for Multimodal Aspect-Based Sentiment Analysis

Weakly Correlated Multimodal Sentiment Analysis: New Dataset and Topic-oriented Model

Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network

Multimodal sentiment analysis based on multiple attention

Tri-Modalities Fusion for Multimodal Sentiment Analysis

UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition

Multimodal Sentiment Analysis With Two-Phase Multi-Task Learning

Modality-Invariant and -Specific Representations with Crossmodal Transformer for Multimodal Sentiment Analysis

Scanning, Attention, and Reasoning Multimodal Content for Sentiment Analysis.