Can MLLMs Understand the Deep Implication Behind Chinese Images?

Chenhao Zhang,Xi Feng,Yuelin Bai,Xinrun Du,Jinchang Hou,Kaixin Deng,Guangzeng Han,Qinrui Li,Bingli Wang,Jiaheng Liu,Xingwei Qu,Yifei Zhang,Qixuan Zhao,Yiming Liang,Ziqiang Liu,Feiteng Fang,Min Yang,Wenhao Huang,Chenghua Lin,Ge Zhang,Shiwen Ni
2024-10-18
Abstract:As the capabilities of Multimodal Large Language Models (MLLMs) continue to improve, the need for higher-order capability evaluation of MLLMs is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To fill the gap, we introduce the **C**hinese **I**mage **I**mplication understanding **Bench**mark, **CII-Bench**, which aims to assess the higher-order perception and understanding capabilities of MLLMs for Chinese images. CII-Bench stands out in several ways compared to existing benchmarks. Firstly, to ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model's understanding of Chinese traditional culture. Through extensive experiments on CII-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on CII-Bench. The highest accuracy of MLLMs attains 64.4%, where as human accuracy averages 78.2%, peaking at an impressive 81.0%. Subsequently, MLLMs perform worse on Chinese traditional culture images, suggesting limitations in their ability to understand high-level semantics and lack a deep knowledge base of Chinese traditional culture. Finally, it is observed that most models exhibit enhanced accuracy when image emotion hints are incorporated into the prompts. We believe that CII-Bench will enable MLLMs to gain a better understanding of Chinese semantics and Chinese-specific images, advancing the journey towards expert artificial general intelligence (AGI). Our project is publicly available at <a class="link-external link-https" href="https://cii-bench.github.io/" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence,Computer Vision and Pattern Recognition,Computers and Society
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to fill a gap in existing research, that is, to evaluate the ability of Multimodal Large Language Models (MLLMs) to understand the deep - meaning of Chinese images. Specifically, the paper proposes a benchmark test named **CII - Bench** for evaluating the high - level perception, reasoning and understanding abilities of MLLMs when processing Chinese images. #### Background and motivation 1. **Development of Multimodal Large Language Models**: - In recent years, Multimodal Large Language Models have made remarkable progress in fields such as natural language processing and computer vision. - These models can not only process and generate text, but also integrate and interpret information in multiple modalities, such as images, sounds and videos. 2. **Need for evaluation of high - level perception ability**: - Although significant progress has been made in image recognition and generation tasks, a key research question is: can these models truly understand and interpret images with deep - level meanings? - Previous work has mainly focused on understanding the meanings of English images, such as the II - Bench benchmark test, but lacks similar evaluations of Chinese images. 3. **Characteristics of Chinese images**: - Chinese images often contain richer scenes and deeper meanings, especially those related to Chinese traditional culture, such as the famous Chinese traditional landscape paintings. - These images not only depict natural scenery, but also convey profound philosophical concepts, such as the harmonious co - existence of man and nature. #### Main contributions 1. **Proposing the CII - Bench benchmark test**: - CII - Bench is the first benchmark test specifically designed to evaluate the ability of MLLMs to understand the meanings of Chinese images. - This benchmark test includes 698 images and 800 multiple - choice questions, covering six areas: life, art, society, politics, environment and Chinese traditional culture. 2. **Designing comprehensive evaluation indicators**: - An evaluation indicator is designed based on GPT - 4o, which is more in line with human annotation and more suitable for evaluating the understanding of Chinese traditional paintings. 3. **Experimental results**: - The experimental results show that there is a significant performance gap between MLLMs and humans. The highest accuracy rate of MLLMs is 64.4%, while the average accuracy rate of humans is 78.2% and the highest reaches 81.0%. - MLLMs perform poorly when processing images of Chinese traditional culture, indicating that current models are insufficient in understanding high - level semantics and Chinese cultural knowledge. - Combining image emotion prompts usually can improve the model's score, but it also shows the limitations of the model in emotion understanding. #### Conclusion Through the CII - Bench benchmark test, researchers can more clearly understand the application potential of MLLMs in a cross - cultural environment and promote Multimodal Large Language Models to move towards expert - level artificial general intelligence (AGI).