A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

Xiujie Song,Mengyue Wu,Kenny Q. Zhu,Chunhao Zhang,Yanyi Chen
2024-06-14
Abstract:Large Vision-Language Models (LVLMs), despite their recent success, are hardly comprehensively tested for their cognitive abilities. Inspired by the prevalent use of the "Cookie Theft" task in human cognition test, we propose a novel evaluation benchmark to evaluate high-level cognitive ability of LVLMs using images with rich semantics. It defines eight reasoning capabilities and consists of an image description task and a visual question answering task. Our evaluation on well-known LVLMs shows that there is still a large gap in cognitive ability between LVLMs and humans.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to evaluate the advanced cognitive abilities of large vision-language models (LVLMs). Despite the significant success these models have achieved recently, their cognitive abilities have not been fully tested. Inspired by the "cookie theft" task commonly used in human cognitive testing, the authors propose a new evaluation benchmark to assess the advanced cognitive abilities of LVLMs when handling semantically rich images. Specifically, the paper proposes the following points: 1. **Construction of the Evaluation Benchmark**: Defines eight reasoning abilities and designs image description tasks and visual question answering tasks to comprehensively evaluate the advanced cognitive abilities of LVLMs. 2. **Comparison with Human Cognitive Testing**: By drawing on the design principles of the "cookie theft" task, ensures that the evaluation benchmark can effectively reflect the cognitive level of the models. 3. **Evaluation of Existing Models**: Evaluates well-known existing LVLMs, showing that these models still have a significant gap compared to human cognitive abilities. ### Main Contributions 1. **First Application of Human Cognitive Testing to LVLMs Evaluation**: This is the first attempt to incorporate cognitive tests designed for humans into the evaluation of LVLMs. 2. **Creation and Open-Sourcing of the Largest "Cookie Theft"-Like Image Dataset**: These images are used to evaluate the cognitive abilities of LVLMs. 3. **Revealing the Gap Between LVLMs and Human Cognitive Abilities**: Evaluation results indicate that LVLMs still need further improvement in cognitive abilities, providing a valuable benchmark for future research. ### Methods and Experiments 1. **Dataset Construction**: - **Image Collection**: Manually collected images from platforms like Pinterest based on specific criteria, ensuring the images contain interesting stories, rich causal chains, and appropriate content complexity. - **Image Annotation**: Hired annotators to label the images, including entities, causal chains, and descriptions. 2. **Task Design**: - **Image Description Task**: Requires models to understand and describe the stories in the images through advanced cognitive reasoning. - **Visual Question Answering Task**: Designed multiple-choice questions covering different types of advanced cognitive reasoning. 3. **Evaluation Strategy**: - **Description Task Evaluation**: Evaluates model performance from both low-level recognition ability and high-level cognitive ability, calculating recognition scores and cognitive scores respectively. - **Visual Question Answering Task Evaluation**: Uses accuracy as the evaluation metric to assess model performance on different types of reasoning tasks. ### Experimental Results 1. **Description Task**: - **Recognition Ability**: GPT-4o performed best in recognition ability, identifying more entities. - **Cognitive Ability**: GPT-4o also performed best in cognitive ability, but all open-source models showed very low performance in certain reasoning types, such as event reasoning, event relationship reasoning, and next moment event reasoning, indicating their limited understanding of image stories. 2. **Visual Question Answering Task**: - GPT-4o performed best in the visual question answering task but still had a significant gap compared to human levels. Open-source models performed better in positional reasoning but poorly in event-related reasoning. ### Conclusion By constructing a new evaluation benchmark, this paper reveals the current deficiencies of LVLMs in advanced cognitive abilities, providing important references for future model development and improvement.