Abstract:With the rapid development of Large Vision-Language Models (VLMs), both general-domain models and those specifically tailored for remote sensing Earth observation, have demonstrated exceptional perception and reasoning abilities within this specific field. However, the current absence of a comprehensive benchmark for holistically evaluating the remote sensing capabilities of these VLMs represents a significant gap. To bridge this gap, we propose COREval, the first benchmark designed to comprehensively and objectively evaluate the hierarchical remote sensing capabilities of VLMs. Concentrating on 2 primary capability dimensions essential to remote sensing: perception and reasoning, we further categorize 6 secondary dimensions and 22 leaf tasks to ensure a well-rounded assessment coverage for this specific field. COREval guarantees the quality of the total of 6,263 problems through a rigorous process of data collection from 50 globally distributed cities, question construction and quality control, and the format of multiple-choice questions with definitive answers allows for an objective and straightforward evaluation of VLM performance. We conducted a holistic evaluation of 13 prominent open-source VLMs from both the general and remote sensing domains, highlighting current shortcomings in their remote sensing capabilities and providing directions for improvements in their application within this specialized context. We hope that COREval will serve as a valuable resource and offer deeper insights into the challenges and potential of VLMs in the field of remote sensing.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the current lack of a comprehensive and objective benchmark for comprehensively evaluating the capabilities of large - scale vision - language models (VLMs) in the remote sensing field. Specifically, although existing VLMs in the general domain and those specifically designed for remote - sensing earth observation have demonstrated excellent perception and reasoning capabilities in specific domains, there is currently no systematic benchmark for comprehensively evaluating the performance of these models in remote - sensing tasks. ### Main contributions of the paper 1. **Proposing the COREval benchmark**: COREval is the first benchmark specifically designed to comprehensively and objectively evaluate the perception and reasoning capabilities of large - scale vision - language models in the remote - sensing field. 2. **Constructing a hierarchical capability classification system**: Perception and reasoning are regarded as first - level capabilities (L - 1), which are further subdivided into 6 second - level dimensions (L - 2) and 22 third - level tasks (L - 3), ensuring a wide coverage of the evaluation. 3. **Data collection and question construction**: 6,263 questions were manually collected through multi - source satellites, platforms and products, covering 50 globally distributed cities, and public data sets were avoided to ensure the objectivity of the evaluation. 4. **Multiple question - construction methods**: Three methods, namely label - driven, base - model - driven and human - machine collaboration, were used to construct questions, ensuring the quality and diversity of the questions. 5. **Quality control**: Through a multi - stage quality - control process, the accuracy and reliability of all questions were ensured. ### Main findings Through the evaluation of 13 mainstream open - source VLMs, the experimental results revealed the following three key findings: 1. **Good basic remote - sensing capabilities**: Both VLMs in the general domain and RSVLMs perform well in terms of image - level perception capabilities. 2. **Weak fine - grained instance - perception capabilities**: Almost all VLMs face challenges in fine - grained object perception and reasoning about relationships between instances. 3. **Limited advanced remote - sensing reasoning capabilities**: Existing VLMs perform poorly in advanced reasoning tasks involving complex remote - sensing scenes, social attributes and specific remote - sensing features. Through COREval, the researchers hope to provide a valuable resource for VLMs in the remote - sensing field and provide in - depth insights for future research and development.

COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

MMBench: Is Your Multi-modal Model an All-around Player?

See, Perceive, and Answer: A Unified Benchmark for High-Resolution Postdisaster Evaluation in Remote Sensing Images

Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data

VHELM: A Holistic Evaluation of Vision Language Models

RSGPT: A Remote Sensing Vision Language Model and Benchmark

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models