Abstract:The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at <a class="link-external link-https" href="https://huggingface.co/datasets/m-a-p/II-Bench" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the lack of advanced perceptual ability evaluation in multimodal large language models (MLLMs) regarding the understanding of implicit meanings in images. Specifically, the paper proposes a benchmark named **II-Bench** to evaluate MLLMs' advanced perception, reasoning, and understanding capabilities when dealing with complex images. #### Background and Motivation 1. **Existing Challenges**: Although current multimodal large language models have made significant progress in various tasks, their evaluation in advanced perceptual abilities (such as emotion understanding and deep meaning extraction) remains insufficient. 2. **Importance of Implicit Meanings in Images**: Images are not just collections of visual information; they often contain human emotions and cultural contexts. Understanding these implicit meanings in images requires models to have advanced perceptual abilities. 3. **Limitations of Current Benchmarks**: Existing multimodal benchmarks mainly focus on simple image understanding and knowledge Q&A, lacking comprehensive evaluation of advanced perceptual abilities. #### Solution 1. **II-Bench Benchmark**: The paper proposes a new benchmark **II-Bench**, specifically designed to evaluate MLLMs' performance in understanding implicit meanings in images. 2. **Dataset Composition**: II-Bench includes 1,222 images covering six domains (life, art, society, psychology, environment, and others) and various types of images (illustrations, memes, posters, multi-panel comics, single-panel comics, signs, and paintings). 3. **Evaluation Method**: By designing multiple-choice questions, each image is paired with 1 to 3 questions, totaling 1,434 questions, to assess the model's understanding ability. #### Key Findings 1. **Performance Gap**: There is a significant gap between MLLMs' performance on II-Bench and human performance. The highest accuracy model achieved 74.8%, while the average human accuracy was 90%, with the highest reaching 98%. 2. **Domain Performance Differences**: Models performed worse in domains containing abstract and complex information (such as art and psychology) and better in domains like environment, life, and society. 3. **Impact of Emotional Cues**: Incorporating emotional polarity information of images into prompts can significantly improve model accuracy, indicating a deficiency in models' emotional understanding. #### Future Prospects 1. **Promoting Research**: II-Bench aims to inspire the community to develop the next generation of MLLMs, advancing towards more advanced artificial general intelligence (AGI). 2. **Improving Models**: Analyzing the evaluation results of II-Bench can identify model deficiencies, guiding researchers to develop more powerful multimodal models. In summary, this paper fills the gap in the evaluation of advanced perceptual abilities in multimodal large language models by proposing the II-Bench benchmark, providing an important tool and direction for future research.

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

Can MLLMs Understand the Deep Implication Behind Chinese Images?

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

A Survey on Benchmarks of Multimodal Large Language Models

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

MileBench: Benchmarking MLLMs in Long Context

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

MMBench: Is Your Multi-modal Model an All-around Player?

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Needle In A Multimodal Haystack

MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models