Abstract:We propose LogicVista, an evaluation benchmark that assesses the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in Visual contexts. Recent advancements in MLLMs have demonstrated various fascinating abilities, from crafting poetry based on an image to performing mathematical reasoning. However, there is still a lack of systematic evaluation of MLLMs' proficiency in logical reasoning tasks, which are essential for activities like navigation and puzzle-solving. Thus we evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation. A total of 8 MLLMs are comprehensively evaluated using LogicVista. Code and Data Available at <a class="link-external link-https" href="https://github.com/Yijia-Xiao/LogicVista" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address the lack of systematic evaluation of multimodal large language models (MLLMs) in logical reasoning tasks. Although current multimodal large language models perform well in tasks such as image generation poetry and mathematical reasoning, their capabilities in critical logical reasoning tasks have not been fully assessed. These logical reasoning tasks are crucial for activities such as navigation and puzzle-solving.
Specifically, the paper proposes **LogicVista**, a benchmark for evaluating the comprehensive logical reasoning abilities of multimodal large language models in visual contexts. LogicVista includes five main categories of logical reasoning, covering nine different abilities, and is evaluated through 448 multiple-choice questions. Each question is annotated with the correct answer and a human-written reasoning process, supporting both open-ended and multiple-choice evaluations.
### Main Contributions
1. **Comprehensive Coverage of Logical Reasoning Tasks**: LogicVista covers five main logical reasoning tasks, including inductive reasoning, deductive reasoning, numerical reasoning, spatial reasoning, and mechanical reasoning.
2. **Diverse Multimodal Abilities**: It evaluates nine different multimodal abilities, such as charts, OCR, patterns, graphics, tables, 3D shapes, puzzles, sequences, and physics.
3. **Detailed Human Annotations**: All images, instructions, answers, and reasoning processes are manually annotated and verified to ensure data accuracy and completeness.
4. **Flexible Evaluation Methods**: It provides multiple-choice and open-ended evaluation strategies, enabling detailed quantitative analysis of different models' reasoning abilities and techniques.
### Related Work
The paper compares existing vision-language benchmarks such as VQAv2, TextCaps, MM-vet, and MathVista, pointing out their deficiencies in evaluating logical reasoning abilities. LogicVista complements these by providing a more comprehensive and systematic evaluation framework.
### Data Collection and Annotation
To ensure the completeness and quality of the evaluation, the paper adopts a strict data collection and annotation process to avoid data leakage. Data sources include proprietary resources that require permission, registration, or payment. Each question is manually annotated, including the correct answer and a detailed reasoning process. Data is stored in JSON format for easy retrieval and processing in the evaluation pipeline.
### Evaluation Setup
The paper selects 8 representative multimodal large language models for evaluation, including LLaVA, MiniGPT-4, Otter, GPT-4 Vision, BLIP-2, and InstructBLIP. Each model generates outputs on the LogicVista dataset, and a LLM-based multiple-choice question extractor is used to separate the multiple-choice answers from the outputs and compare them with the standard answers.
### Performance Analysis
Evaluation results show that many models perform below expectations in logical reasoning tasks, even worse than random guessing. This may be because most multimodal LLMs' training data mainly comes from traditional computer vision datasets like COCO, which focus on recognition tasks rather than complex reasoning tasks. However, in deductive reasoning, numerical reasoning, and mechanical reasoning tasks, the models perform relatively well, possibly because these types of reasoning are more common in real life.
### Conclusion
Logical reasoning ability is fundamental to solving complex tasks, but research in this area is still limited for multimodal LLMs. LogicVista provides a comprehensive evaluation framework that can help researchers better understand these models' performance in logical reasoning tasks, thereby promoting further development of multimodal LLMs in critical thinking and complex tasks.