NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

Pranshu Pandya,Agney S Talwarr,Vatsal Gupta,Tushar Kataria,Vivek Gupta,Dan Roth
2024-07-15
Abstract:Cognitive textual and visual reasoning tasks, such as puzzles, series, and analogies, demand the ability to quickly reason, decipher, and evaluate patterns both textually and spatially. While LLMs and VLMs, through extensive training on large amounts of human-curated data, have attained a high level of pseudo-human intelligence in some common sense reasoning tasks, they still struggle with more complex reasoning tasks that require cognitive understanding. In this work, we introduce a new dataset, NTSEBench, designed to evaluate the cognitive multi-modal reasoning and problem-solving skills of large models. The dataset comprises 2,728 multiple-choice questions comprising of a total of 4,642 images across 26 categories sampled from the NTSE examination conducted nationwide in India, featuring both visual and textual general aptitude questions that do not rely on rote learning. We establish baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison between open source and propriety models, we propose four distinct modeling strategies to handle different modalities (text and images) in the dataset instances.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Information Retrieval
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Assessing Complex Cognitive Reasoning Ability**: Propose a new benchmark dataset, NTSEB ENCH, to evaluate the performance of large language models (LLMs) and visual language models (VLMs) in complex text, visual, and multimodal cognitive reasoning tasks. 2. **Filling Research Gaps**: Existing datasets mainly focus on reasoning tasks in specific domains, while NTSEB ENCH focuses on testing cognitive reasoning abilities in non-specific domains, without relying on rote memorization of knowledge. 3. **Model Performance Comparison**: Compare the performance of open-source models and proprietary models in text and multimodal reasoning tasks through different modeling strategies and prompting methods, and analyze the impact of different modeling strategies on model accuracy. 4. **Enhancing Model Capabilities**: Reveal the limitations of current LLMs and VLMs in handling complex cognitive reasoning tasks through experimental results, and provide guidance for future research directions.