Abstract:Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing a LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs. We evaluate existing state-of-the-art VLMs, and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans. As an early step, we propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs. The first stage involves employing supervised fine-tuning of VLMs using step-by-step reasoning samples automatically generated by LLMs. In the second stage, we further augment the training process by incorporating feedback provided by LLMs to produce reasoning chains that are highly consistent and grounded. We empirically highlight the effectiveness of our framework in both reasoning performance and consistency.

Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning

Advancement in Graph Understanding: A Multimodal Benchmark and Fine-Tuning of Vision-Language Models

Enhancing Advanced Visual Reasoning Ability of Large Language Models

VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models

Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models