Abstract:While numerous recent benchmarks focus on evaluating generic Vision-Language Models (VLMs), they fall short in addressing the unique demands of geospatial applications. Generic VLM benchmarks are not designed to handle the complexities of geospatial data, which is critical for applications such as environmental monitoring, urban planning, and disaster management. Some of the unique challenges in geospatial domain include temporal analysis for changes, counting objects in large quantities, detecting tiny objects, and understanding relationships between entities occurring in Remote Sensing imagery. To address this gap in the geospatial domain, we present GEOBench-VLM, a comprehensive benchmark specifically designed to evaluate VLMs on geospatial tasks, including scene understanding, object counting, localization, fine-grained categorization, and temporal analysis. Our benchmark features over 10,000 manually verified instructions and covers a diverse set of variations in visual conditions, object type, and scale. We evaluate several state-of-the-art VLMs to assess their accuracy within the geospatial context. The results indicate that although existing VLMs demonstrate potential, they face challenges when dealing with geospatial-specific examples, highlighting the room for further improvements. Specifically, the best-performing GPT4o achieves only 40\% accuracy on MCQs, which is only double the random guess performance. Our benchmark is publicly available at <a class="link-external link-https" href="https://github.com/The-AI-Alliance/GEO-Bench-VLM" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the unique challenges faced by existing vision - language models (VLMs) when processing geospatial data. Although existing general - purpose VLM benchmarks can evaluate the performance of models in general visual understanding tasks, they are not suitable for handling the complexity specific to geospatial applications, such as tasks in fields like environmental monitoring, urban planning, and disaster management. These applications need to handle specific challenges such as time - series analysis, large - scale object counting, small - object detection, and understanding the relationships between entities in remote - sensing images. To fill this gap, the authors propose GEOBench - VLM, which is a comprehensive benchmark suite specifically designed to evaluate VLMs on geospatial tasks. GEOBench - VLM covers 8 main categories and 31 subtasks, including scene understanding, object counting, visual localization, image captioning, time understanding, non - optical image processing, reference segmentation, and relational reasoning, aiming to comprehensively evaluate the performance of VLMs in Earth - observation applications. Through this benchmark, the authors hope to reveal the capabilities and limitations of existing VLMs when processing geospatial data and provide directions for future research and development. Specifically, the paper points out that although some state - of - the - art VLMs perform well on certain tasks, they still face challenges when handling geospatial - specific tasks, indicating that there is room for improvement. For example, the best model, GPT4o, has an accuracy rate of only 40% on multiple - choice questions, which is just twice the performance of random guessing, showing significant limitations on geospatial tasks.

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data

Deep Visual Geo-localization Benchmark

GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark

GeoMeter: Probing Depth and Height Perception of Large Visual-Language Models

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation

LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild

LocateBench: Evaluating the Locating Ability of Vision Language Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models

μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology