Towards Vision-Language Geo-Foundation Model: A Survey

Yue Zhou,Litong Feng,Yiping Ke,Xue Jiang,Junchi Yan,Xue Yang,Wayne Zhang
2024-06-14
Abstract:Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding. However, most methods rely on training with general image datasets, and the lack of geospatial data leads to poor performance on earth observation. Numerous geospatial image-text pair datasets and VLFMs fine-tuned on them have been proposed recently. These new approaches aim to leverage large-scale, multimodal geospatial data to build versatile intelligent models with diverse geo-perceptive capabilities, which we refer to as Vision-Language Geo-Foundation Models (VLGFMs). This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field. In particular, we introduce the background and motivation behind the rise of VLGFMs, highlighting their unique research significance. Then, we systematically summarize the core technologies employed in VLGFMs, including data construction, model architectures, and applications of various multimodal geospatial tasks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To the best of our knowledge, this is the first comprehensive literature review of VLGFMs. We keep tracing related works at <a class="link-external link-https" href="https://github.com/zytx121/Awesome-VLGFM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the poor performance of current Vision - Language Foundation Models (VLFMs) in handling Earth observation tasks. Specifically, most traditional VLFMs rely on general - purpose image datasets for training and lack support for geospatial data, resulting in poor performance in Earth observation tasks. To overcome this limitation, researchers have proposed Vision - Language Geo - Foundation Models (VLGFMs). These models aim to utilize large - scale multi - modal geospatial data to build intelligent models with diverse geographical perception capabilities. The goal of VLGFMs is to improve the generalization and reasoning abilities of the model in Earth observation tasks by integrating image and text information, so as to better support tasks such as remote sensing image classification, object detection, change detection, denoising, land - use segmentation, disaster management, and geolocation. ### Main problem summary: 1. **Limitations of traditional VLFMs**: Most existing VLFMs rely on general - purpose image datasets for training and lack support for geospatial data, resulting in poor performance in Earth observation tasks. 2. **Importance of geospatial data**: Geospatial data is crucial for Earth observation tasks, so models need to be specifically designed and trained for this type of data. 3. **Need for multi - modal fusion**: To improve the generalization and reasoning abilities of the model, it is necessary to combine visual and linguistic information to build models capable of handling complex Earth observation tasks. ### Solutions: - **Introduction of VLGFMs**: By using large - scale multi - modal geospatial data, build intelligent models with diverse geographical perception capabilities. - **Data collection and annotation**: Collect high - quality geospatial image - text pair datasets and perform fine - grained annotation to support model training and optimization. - **Model architecture innovation**: Explore new model architectures and training methods to improve the performance of the model in Earth observation tasks. Through these efforts, VLGFMs are expected to achieve better performance and more extensive applications in the field of Earth observation.