Abstract:Multimodal large language models (MLLMs) have shown remarkable progress in high-level semantic tasks such as visual question answering, image captioning, and emotion recognition. However, despite advancements, there remains a lack of standardized benchmarks for evaluating MLLMs performance in multi-object sentiment analysis, a key task in semantic understanding. To address this gap, we introduce MOSABench, a novel evaluation dataset designed specifically for multi-object sentiment analysis. MOSABench includes approximately 1,000 images with multiple objects, requiring MLLMs to independently assess the sentiment of each object, thereby reflecting real-world complexities. Key innovations in MOSABench include distance-based target annotation, post-processing for evaluation to standardize outputs, and an improved scoring mechanism. Our experiments reveal notable limitations in current MLLMs: while some models, like mPLUG-owl and Qwen-VL2, demonstrate effective attention to sentiment-relevant features, others exhibit scattered focus and performance declines, especially as the spatial distance between objects increases. This research underscores the need for MLLMs to enhance accuracy in complex, multi-object sentiment analysis tasks and establishes MOSABench as a foundational tool for advancing sentiment analysis capabilities in MLLMs.

What problem does this paper attempt to address?

This paper attempts to address the lack of a standardized benchmark for multimodal large language models (MLLMs) in multi - object sentiment analysis tasks. Specifically, most of the existing sentiment analysis datasets focus on single - object samples and are unable to comprehensively evaluate the ability of MLLMs to handle complex multi - object sentiment analysis. In addition, existing datasets have limitations in image - text consistency, instruction adaptability, and multi - object sentiment evaluation, making it difficult to accurately assess the multimodal understanding ability of MLLMs. To solve these problems, the authors introduce MOSABench, a new benchmark specifically designed to evaluate the performance of MLLMs in multi - object sentiment analysis tasks. The main features and contributions of MOSABench include: 1. **Multi - object sentiment analysis**: MOSABench contains approximately 1,000 images with multiple objects, requiring MLLMs to independently evaluate the sentiment of each object, thus reflecting the complexity of the real world. 2. **Distance labeling**: By labeling the spatial distances (such as overlap, proximity, and distance) between objects in the image, the relationship between object distance and sentiment prediction accuracy is revealed. 3. **Post - processing evaluation**: To address the problem of inconsistent output formats of MLLMs, a post - processing method is proposed to standardize the model output and reduce the impact of the format on the evaluation accuracy. 4. **Improved scoring mechanism**: A new scoring mechanism is introduced, which makes independent sentiment judgments for two objects in each sample, assigns 3 points (completely correct), 1 point (partially correct), and 0 points (completely wrong) respectively, and combines traditional indicators such as F1, precision, and recall to provide a more comprehensive evaluation. Through these innovations, MOSABench fills the gap in the multi - object sentiment analysis evaluation benchmark and provides a powerful tool for improving the performance of MLLMs in complex multi - object sentiment analysis tasks. ### Key Formulas - **Distance Calculation Formula**: \[ d=\sqrt{(C1_x - C2_x)^2+(C1_y - C2_y)^2} \] where \(C1\) and \(C2\) are the center - point coordinates of two bounding boxes respectively. ### Summary MOSABench aims to address the deficiencies of existing sentiment analysis datasets in multi - object evaluation. By introducing multi - object sentiment analysis, distance labeling, post - processing evaluation, and an improved scoring mechanism, it provides a more scientific and comprehensive evaluation benchmark for MLLMs.

MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

MMBench: Is Your Multi-modal Model an All-around Player?

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

A Survey on Benchmarks of Multimodal Large Language Models

Exploring Large Language Models for Multimodal Sentiment Analysis: Challenges, Benchmarks, and Future Directions

Robust-MSA: Understanding the Impact of Modality Noise on Multimodal Sentiment Analysis

MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis

MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Weakening the Dominant Role of Text: CMOSI Dataset and Multimodal Semantic Enhancement Network

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Towards Robust Multimodal Sentiment Analysis with Incomplete Data

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs