What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to evaluate the advanced cognitive abilities of large vision-language models (LVLMs). Despite the significant success these models have achieved recently, their cognitive abilities have not been fully tested. Inspired by the "cookie theft" task commonly used in human cognitive testing, the authors propose a new evaluation benchmark to assess the advanced cognitive abilities of LVLMs when handling semantically rich images. Specifically, the paper proposes the following points: 1. **Construction of the Evaluation Benchmark**: Defines eight reasoning abilities and designs image description tasks and visual question answering tasks to comprehensively evaluate the advanced cognitive abilities of LVLMs. 2. **Comparison with Human Cognitive Testing**: By drawing on the design principles of the "cookie theft" task, ensures that the evaluation benchmark can effectively reflect the cognitive level of the models. 3. **Evaluation of Existing Models**: Evaluates well-known existing LVLMs, showing that these models still have a significant gap compared to human cognitive abilities. ### Main Contributions 1. **First Application of Human Cognitive Testing to LVLMs Evaluation**: This is the first attempt to incorporate cognitive tests designed for humans into the evaluation of LVLMs. 2. **Creation and Open-Sourcing of the Largest "Cookie Theft"-Like Image Dataset**: These images are used to evaluate the cognitive abilities of LVLMs. 3. **Revealing the Gap Between LVLMs and Human Cognitive Abilities**: Evaluation results indicate that LVLMs still need further improvement in cognitive abilities, providing a valuable benchmark for future research. ### Methods and Experiments 1. **Dataset Construction**: - **Image Collection**: Manually collected images from platforms like Pinterest based on specific criteria, ensuring the images contain interesting stories, rich causal chains, and appropriate content complexity. - **Image Annotation**: Hired annotators to label the images, including entities, causal chains, and descriptions. 2. **Task Design**: - **Image Description Task**: Requires models to understand and describe the stories in the images through advanced cognitive reasoning. - **Visual Question Answering Task**: Designed multiple-choice questions covering different types of advanced cognitive reasoning. 3. **Evaluation Strategy**: - **Description Task Evaluation**: Evaluates model performance from both low-level recognition ability and high-level cognitive ability, calculating recognition scores and cognitive scores respectively. - **Visual Question Answering Task Evaluation**: Uses accuracy as the evaluation metric to assess model performance on different types of reasoning tasks. ### Experimental Results 1. **Description Task**: - **Recognition Ability**: GPT-4o performed best in recognition ability, identifying more entities. - **Cognitive Ability**: GPT-4o also performed best in cognitive ability, but all open-source models showed very low performance in certain reasoning types, such as event reasoning, event relationship reasoning, and next moment event reasoning, indicating their limited understanding of image stories. 2. **Visual Question Answering Task**: - GPT-4o performed best in the visual question answering task but still had a significant gap compared to human levels. Open-source models performed better in positional reasoning but poorly in event-related reasoning. ### Conclusion By constructing a new evaluation benchmark, this paper reveals the current deficiencies of LVLMs in advanced cognitive abilities, providing important references for future model development and improvement.

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Enhancing Advanced Visual Reasoning Ability of Large Language Models

NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition

Effectiveness Assessment of Recent Large Vision-Language Models

COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?