II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

Ziqiang Liu,Feiteng Fang,Xi Feng,Xinrun Du,Chenhao Zhang,Zekun Wang,Yuelin Bai,Qixuan Zhao,Liyang Fan,Chengguang Gan,Hongquan Lin,Jiaming Li,Yuansheng Ni,Haihong Wu,Yaswanth Narsupalli,Zhigang Zheng,Chengming Li,Xiping Hu,Ruifeng Xu,Xiaojun Chen,Min Yang,Jiaheng Liu,Ruibo Liu,Wenhao Huang,Ge Zhang,Shiwen Ni
2024-06-11
Abstract:The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at <a class="link-external link-https" href="https://huggingface.co/datasets/m-a-p/II-Bench" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the lack of advanced perceptual ability evaluation in multimodal large language models (MLLMs) regarding the understanding of implicit meanings in images. Specifically, the paper proposes a benchmark named **II-Bench** to evaluate MLLMs' advanced perception, reasoning, and understanding capabilities when dealing with complex images. #### Background and Motivation 1. **Existing Challenges**: Although current multimodal large language models have made significant progress in various tasks, their evaluation in advanced perceptual abilities (such as emotion understanding and deep meaning extraction) remains insufficient. 2. **Importance of Implicit Meanings in Images**: Images are not just collections of visual information; they often contain human emotions and cultural contexts. Understanding these implicit meanings in images requires models to have advanced perceptual abilities. 3. **Limitations of Current Benchmarks**: Existing multimodal benchmarks mainly focus on simple image understanding and knowledge Q&A, lacking comprehensive evaluation of advanced perceptual abilities. #### Solution 1. **II-Bench Benchmark**: The paper proposes a new benchmark **II-Bench**, specifically designed to evaluate MLLMs' performance in understanding implicit meanings in images. 2. **Dataset Composition**: II-Bench includes 1,222 images covering six domains (life, art, society, psychology, environment, and others) and various types of images (illustrations, memes, posters, multi-panel comics, single-panel comics, signs, and paintings). 3. **Evaluation Method**: By designing multiple-choice questions, each image is paired with 1 to 3 questions, totaling 1,434 questions, to assess the model's understanding ability. #### Key Findings 1. **Performance Gap**: There is a significant gap between MLLMs' performance on II-Bench and human performance. The highest accuracy model achieved 74.8%, while the average human accuracy was 90%, with the highest reaching 98%. 2. **Domain Performance Differences**: Models performed worse in domains containing abstract and complex information (such as art and psychology) and better in domains like environment, life, and society. 3. **Impact of Emotional Cues**: Incorporating emotional polarity information of images into prompts can significantly improve model accuracy, indicating a deficiency in models' emotional understanding. #### Future Prospects 1. **Promoting Research**: II-Bench aims to inspire the community to develop the next generation of MLLMs, advancing towards more advanced artificial general intelligence (AGI). 2. **Improving Models**: Analyzing the evaluation results of II-Bench can identify model deficiencies, guiding researchers to develop more powerful multimodal models. In summary, this paper fills the gap in the evaluation of advanced perceptual abilities in multimodal large language models by proposing the II-Bench benchmark, providing an important tool and direction for future research.