COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models

Xiao An,Jiaxing Sun,Zihan Gui,Wei He
2024-11-27
Abstract:With the rapid development of Large Vision-Language Models (VLMs), both general-domain models and those specifically tailored for remote sensing Earth observation, have demonstrated exceptional perception and reasoning abilities within this specific field. However, the current absence of a comprehensive benchmark for holistically evaluating the remote sensing capabilities of these VLMs represents a significant gap. To bridge this gap, we propose COREval, the first benchmark designed to comprehensively and objectively evaluate the hierarchical remote sensing capabilities of VLMs. Concentrating on 2 primary capability dimensions essential to remote sensing: perception and reasoning, we further categorize 6 secondary dimensions and 22 leaf tasks to ensure a well-rounded assessment coverage for this specific field. COREval guarantees the quality of the total of 6,263 problems through a rigorous process of data collection from 50 globally distributed cities, question construction and quality control, and the format of multiple-choice questions with definitive answers allows for an objective and straightforward evaluation of VLM performance. We conducted a holistic evaluation of 13 prominent open-source VLMs from both the general and remote sensing domains, highlighting current shortcomings in their remote sensing capabilities and providing directions for improvements in their application within this specialized context. We hope that COREval will serve as a valuable resource and offer deeper insights into the challenges and potential of VLMs in the field of remote sensing.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the current lack of a comprehensive and objective benchmark for comprehensively evaluating the capabilities of large - scale vision - language models (VLMs) in the remote sensing field. Specifically, although existing VLMs in the general domain and those specifically designed for remote - sensing earth observation have demonstrated excellent perception and reasoning capabilities in specific domains, there is currently no systematic benchmark for comprehensively evaluating the performance of these models in remote - sensing tasks. ### Main contributions of the paper 1. **Proposing the COREval benchmark**: COREval is the first benchmark specifically designed to comprehensively and objectively evaluate the perception and reasoning capabilities of large - scale vision - language models in the remote - sensing field. 2. **Constructing a hierarchical capability classification system**: Perception and reasoning are regarded as first - level capabilities (L - 1), which are further subdivided into 6 second - level dimensions (L - 2) and 22 third - level tasks (L - 3), ensuring a wide coverage of the evaluation. 3. **Data collection and question construction**: 6,263 questions were manually collected through multi - source satellites, platforms and products, covering 50 globally distributed cities, and public data sets were avoided to ensure the objectivity of the evaluation. 4. **Multiple question - construction methods**: Three methods, namely label - driven, base - model - driven and human - machine collaboration, were used to construct questions, ensuring the quality and diversity of the questions. 5. **Quality control**: Through a multi - stage quality - control process, the accuracy and reliability of all questions were ensured. ### Main findings Through the evaluation of 13 mainstream open - source VLMs, the experimental results revealed the following three key findings: 1. **Good basic remote - sensing capabilities**: Both VLMs in the general domain and RSVLMs perform well in terms of image - level perception capabilities. 2. **Weak fine - grained instance - perception capabilities**: Almost all VLMs face challenges in fine - grained object perception and reasoning about relationships between instances. 3. **Limited advanced remote - sensing reasoning capabilities**: Existing VLMs perform poorly in advanced reasoning tasks involving complex remote - sensing scenes, social attributes and specific remote - sensing features. Through COREval, the researchers hope to provide a valuable resource for VLMs in the remote - sensing field and provide in - depth insights for future research and development.