Abstract:As the capabilities of Multimodal Large Language Models (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the **C**hinese **I**mage **I**mplication understanding **Bench**mark, **CII-Bench**, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model's understanding of Chinese traditional culture. Through extensive experiments on CII-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on CII-Bench. The highest accuracy of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional culture images, suggesting limitations in their ability to understand high-level semantics and lack a deep knowledge base of Chinese traditional culture. Finally, it is observed that most models exhibit enhanced accuracy when image emotion hints are incorporated into the prompts. We believe that CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics and Chinese-specific images, advancing the journey towards expert artificial general intelligence (AGI). Our project is publicly available at <a class="link-external link-https" href="https://cii-bench.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to fill a gap in existing research, that is, to evaluate the ability of Multimodal Large Language Models (MLLMs) to understand the deep - meaning of Chinese images. Specifically, the paper proposes a benchmark test named **CII - Bench** for evaluating the high - level perception, reasoning and understanding abilities of MLLMs when processing Chinese images. #### Background and motivation 1. **Development of Multimodal Large Language Models**: - In recent years, Multimodal Large Language Models have made remarkable progress in fields such as natural language processing and computer vision. - These models can not only process and generate text, but also integrate and interpret information in multiple modalities, such as images, sounds and videos. 2. **Need for evaluation of high - level perception ability**: - Although significant progress has been made in image recognition and generation tasks, a key research question is: can these models truly understand and interpret images with deep - level meanings? - Previous work has mainly focused on understanding the meanings of English images, such as the II - Bench benchmark test, but lacks similar evaluations of Chinese images. 3. **Characteristics of Chinese images**: - Chinese images often contain richer scenes and deeper meanings, especially those related to Chinese traditional culture, such as the famous Chinese traditional landscape paintings. - These images not only depict natural scenery, but also convey profound philosophical concepts, such as the harmonious co - existence of man and nature. #### Main contributions 1. **Proposing the CII - Bench benchmark test**: - CII - Bench is the first benchmark test specifically designed to evaluate the ability of MLLMs to understand the meanings of Chinese images. - This benchmark test includes 698 images and 800 multiple - choice questions, covering six areas: life, art, society, politics, environment and Chinese traditional culture. 2. **Designing comprehensive evaluation indicators**: - An evaluation indicator is designed based on GPT - 4o, which is more in line with human annotation and more suitable for evaluating the understanding of Chinese traditional paintings. 3. **Experimental results**: - The experimental results show that there is a significant performance gap between MLLMs and humans. The highest accuracy rate of MLLMs is 64.4%, while the average accuracy rate of humans is 78.2% and the highest reaches 81.0%. - MLLMs perform poorly when processing images of Chinese traditional culture, indicating that current models are insufficient in understanding high - level semantics and Chinese cultural knowledge. - Combining image emotion prompts usually can improve the model's score, but it also shows the limitations of the model in emotion understanding. #### Conclusion Through the CII - Bench benchmark test, researchers can more clearly understand the application potential of MLLMs in a cross - cultural environment and promote Multimodal Large Language Models to move towards expert - level artificial general intelligence (AGI).

Can MLLMs Understand the Deep Implication Behind Chinese Images?

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

A Survey on Benchmarks of Multimodal Large Language Models

Needle In A Multimodal Haystack

MileBench: Benchmarking MLLMs in Long Context

CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

CMMLU: Measuring massive multitask language understanding in Chinese

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

LIME: Less Is More for MLLM Evaluation

Explore the Hallucination on Low-level Perception for MLLMs

MULTI: Multimodal Understanding Leaderboard with Text and Images

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark