Abstract:We propose LogicVista, an evaluation benchmark that assesses the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in Visual contexts. Recent advancements in MLLMs have demonstrated various fascinating abilities, from crafting poetry based on an image to performing mathematical reasoning. However, there is still a lack of systematic evaluation of MLLMs' proficiency in logical reasoning tasks, which are essential for activities like navigation and puzzle-solving. Thus we evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation. A total of 8 MLLMs are comprehensively evaluated using LogicVista. Code and Data Available at <a class="link-external link-https" href="https://github.com/Yijia-Xiao/LogicVista" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the lack of systematic evaluation of multimodal large language models (MLLMs) in logical reasoning tasks. Although current multimodal large language models perform well in tasks such as image generation poetry and mathematical reasoning, their capabilities in critical logical reasoning tasks have not been fully assessed. These logical reasoning tasks are crucial for activities such as navigation and puzzle-solving. Specifically, the paper proposes **LogicVista**, a benchmark for evaluating the comprehensive logical reasoning abilities of multimodal large language models in visual contexts. LogicVista includes five main categories of logical reasoning, covering nine different abilities, and is evaluated through 448 multiple-choice questions. Each question is annotated with the correct answer and a human-written reasoning process, supporting both open-ended and multiple-choice evaluations. ### Main Contributions 1. **Comprehensive Coverage of Logical Reasoning Tasks**: LogicVista covers five main logical reasoning tasks, including inductive reasoning, deductive reasoning, numerical reasoning, spatial reasoning, and mechanical reasoning. 2. **Diverse Multimodal Abilities**: It evaluates nine different multimodal abilities, such as charts, OCR, patterns, graphics, tables, 3D shapes, puzzles, sequences, and physics. 3. **Detailed Human Annotations**: All images, instructions, answers, and reasoning processes are manually annotated and verified to ensure data accuracy and completeness. 4. **Flexible Evaluation Methods**: It provides multiple-choice and open-ended evaluation strategies, enabling detailed quantitative analysis of different models' reasoning abilities and techniques. ### Related Work The paper compares existing vision-language benchmarks such as VQAv2, TextCaps, MM-vet, and MathVista, pointing out their deficiencies in evaluating logical reasoning abilities. LogicVista complements these by providing a more comprehensive and systematic evaluation framework. ### Data Collection and Annotation To ensure the completeness and quality of the evaluation, the paper adopts a strict data collection and annotation process to avoid data leakage. Data sources include proprietary resources that require permission, registration, or payment. Each question is manually annotated, including the correct answer and a detailed reasoning process. Data is stored in JSON format for easy retrieval and processing in the evaluation pipeline. ### Evaluation Setup The paper selects 8 representative multimodal large language models for evaluation, including LLaVA, MiniGPT-4, Otter, GPT-4 Vision, BLIP-2, and InstructBLIP. Each model generates outputs on the LogicVista dataset, and a LLM-based multiple-choice question extractor is used to separate the multiple-choice answers from the outputs and compare them with the standard answers. ### Performance Analysis Evaluation results show that many models perform below expectations in logical reasoning tasks, even worse than random guessing. This may be because most multimodal LLMs' training data mainly comes from traditional computer vision datasets like COCO, which focus on recognition tasks rather than complex reasoning tasks. However, in deductive reasoning, numerical reasoning, and mechanical reasoning tasks, the models perform relatively well, possibly because these types of reasoning are more common in real life. ### Conclusion Logical reasoning ability is fundamental to solving complex tasks, but research in this area is still limited for multimodal LLMs. LogicVista provides a comprehensive evaluation framework that can help researchers better understand these models' performance in logical reasoning tasks, thereby promoting further development of multimodal LLMs in critical thinking and complex tasks.

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset

VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning