NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

Pranshu Pandya,Agney S Talwarr,Vatsal Gupta,Tushar Kataria,Vivek Gupta,Dan Roth

2024-07-15

Abstract:Cognitive textual and visual reasoning tasks, such as puzzles, series, and analogies, demand the ability to quickly reason, decipher, and evaluate patterns both textually and spatially. While LLMs and VLMs, through extensive training on large amounts of human-curated data, have attained a high level of pseudo-human intelligence in some common sense reasoning tasks, they still struggle with more complex reasoning tasks that require cognitive understanding. In this work, we introduce a new dataset, NTSEBench, designed to evaluate the cognitive multi-modal reasoning and problem-solving skills of large models. The dataset comprises 2,728 multiple-choice questions comprising of a total of 4,642 images across 26 categories sampled from the NTSE examination conducted nationwide in India, featuring both visual and textual general aptitude questions that do not rely on rote learning. We establish baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison between open source and propriety models, we propose four distinct modeling strategies to handle different modalities (text and images) in the dataset instances.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Information Retrieval

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Assessing Complex Cognitive Reasoning Ability**: Propose a new benchmark dataset, NTSEB ENCH, to evaluate the performance of large language models (LLMs) and visual language models (VLMs) in complex text, visual, and multimodal cognitive reasoning tasks. 2. **Filling Research Gaps**: Existing datasets mainly focus on reasoning tasks in specific domains, while NTSEB ENCH focuses on testing cognitive reasoning abilities in non-specific domains, without relying on rote memorization of knowledge. 3. **Model Performance Comparison**: Compare the performance of open-source models and proprietary models in text and multimodal reasoning tasks through different modeling strategies and prompting methods, and analyze the impact of different modeling strategies on model accuracy. 4. **Enhancing Model Capabilities**: Reveal the limitations of current LLMs and VLMs in handling complex cognitive reasoning tasks through experimental results, and provide guidance for future research directions.

NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

ReMI: A Dataset for Reasoning with Multiple Images

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

SAT: Spatial Aptitude Training for Multimodal Language Models

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models

NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

NL-Eye: Abductive NLI for Images

How Far Are We from Intelligent Visual Deductive Reasoning?

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis