Abstract:Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding. However, most methods rely on training with general image datasets, and the lack of geospatial data leads to poor performance on earth observation. Numerous geospatial image-text pair datasets and VLFMs fine-tuned on them have been proposed recently. These new approaches aim to leverage large-scale, multimodal geospatial data to build versatile intelligent models with diverse geo-perceptive capabilities, which we refer to as Vision-Language Geo-Foundation Models (VLGFMs). This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field. In particular, we introduce the background and motivation behind the rise of VLGFMs, highlighting their unique research significance. Then, we systematically summarize the core technologies employed in VLGFMs, including data construction, model architectures, and applications of various multimodal geospatial tasks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To the best of our knowledge, this is the first comprehensive literature review of VLGFMs. We keep tracing related works at <a class="link-external link-https" href="https://github.com/zytx121/Awesome-VLGFM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the poor performance of current Vision - Language Foundation Models (VLFMs) in handling Earth observation tasks. Specifically, most traditional VLFMs rely on general - purpose image datasets for training and lack support for geospatial data, resulting in poor performance in Earth observation tasks. To overcome this limitation, researchers have proposed Vision - Language Geo - Foundation Models (VLGFMs). These models aim to utilize large - scale multi - modal geospatial data to build intelligent models with diverse geographical perception capabilities. The goal of VLGFMs is to improve the generalization and reasoning abilities of the model in Earth observation tasks by integrating image and text information, so as to better support tasks such as remote sensing image classification, object detection, change detection, denoising, land - use segmentation, disaster management, and geolocation. ### Main problem summary: 1. **Limitations of traditional VLFMs**: Most existing VLFMs rely on general - purpose image datasets for training and lack support for geospatial data, resulting in poor performance in Earth observation tasks. 2. **Importance of geospatial data**: Geospatial data is crucial for Earth observation tasks, so models need to be specifically designed and trained for this type of data. 3. **Need for multi - modal fusion**: To improve the generalization and reasoning abilities of the model, it is necessary to combine visual and linguistic information to build models capable of handling complex Earth observation tasks. ### Solutions: - **Introduction of VLGFMs**: By using large - scale multi - modal geospatial data, build intelligent models with diverse geographical perception capabilities. - **Data collection and annotation**: Collect high - quality geospatial image - text pair datasets and perform fine - grained annotation to support model training and optimization. - **Model architecture innovation**: Explore new model architectures and training methods to improve the performance of the model in Earth observation tasks. Through these efforts, VLGFMs are expected to achieve better performance and more extensive applications in the field of Earth observation.

Towards Vision-Language Geo-Foundation Model: A Survey

Vision-Language Models for Vision Tasks: A Survey

Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey

Vision-Language Models in Remote Sensing: Current progress and future trends

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

An Introduction to Vision-Language Modeling

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence

On the Opportunities and Challenges of Foundation Models for GeoAI (Vision Paper)

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models

Vision Language Models in Autonomous Driving: A Survey and Outlook

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery

Vision Language Models in Autonomous Driving and Intelligent Transportation Systems