MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image

Shezheng Song,Chengxiang He,Shasha Li,Shan Zhao,Chengyu Wang,Tianwei Yan,Xiaopeng Li,Qian Wan,Jun Ma,Jie Yu,Xiaoguang Mao
2024-11-25
Abstract:Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to address the lack of a standardized benchmark for multimodal large language models (MLLMs) in multi - object sentiment analysis tasks. Specifically, most of the existing sentiment analysis datasets focus on single - object samples and are unable to comprehensively evaluate the ability of MLLMs to handle complex multi - object sentiment analysis. In addition, existing datasets have limitations in image - text consistency, instruction adaptability, and multi - object sentiment evaluation, making it difficult to accurately assess the multimodal understanding ability of MLLMs. To solve these problems, the authors introduce MOSABench, a new benchmark specifically designed to evaluate the performance of MLLMs in multi - object sentiment analysis tasks. The main features and contributions of MOSABench include: 1. **Multi - object sentiment analysis**: MOSABench contains approximately 1,000 images with multiple objects, requiring MLLMs to independently evaluate the sentiment of each object, thus reflecting the complexity of the real world. 2. **Distance labeling**: By labeling the spatial distances (such as overlap, proximity, and distance) between objects in the image, the relationship between object distance and sentiment prediction accuracy is revealed. 3. **Post - processing evaluation**: To address the problem of inconsistent output formats of MLLMs, a post - processing method is proposed to standardize the model output and reduce the impact of the format on the evaluation accuracy. 4. **Improved scoring mechanism**: A new scoring mechanism is introduced, which makes independent sentiment judgments for two objects in each sample, assigns 3 points (completely correct), 1 point (partially correct), and 0 points (completely wrong) respectively, and combines traditional indicators such as F1, precision, and recall to provide a more comprehensive evaluation. Through these innovations, MOSABench fills the gap in the multi - object sentiment analysis evaluation benchmark and provides a powerful tool for improving the performance of MLLMs in complex multi - object sentiment analysis tasks. ### Key Formulas - **Distance Calculation Formula**: \[ d=\sqrt{(C1_x - C2_x)^2+(C1_y - C2_y)^2} \] where \(C1\) and \(C2\) are the center - point coordinates of two bounding boxes respectively. ### Summary MOSABench aims to address the deficiencies of existing sentiment analysis datasets in multi - object evaluation. By introducing multi - object sentiment analysis, distance labeling, post - processing evaluation, and an improved scoring mechanism, it provides a more scientific and comprehensive evaluation benchmark for MLLMs.