Abstract:The remarkable achievements of ChatGPT and Generative Pre-trained Transformer 4 (GPT-4) have sparked a wave of interest and research in the field of large language models (LLMs) for artificial general intelligence (AGI). These models provide intelligent solutions that are closer to human thinking, enabling us to use general artificial intelligence (AI) to solve problems in various applications. However, in the field of remote sensing (RS), the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research in RS focuses primarily on visual-understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-LMs (VLMs) excel as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. VLMs can go beyond visual recognition of RS images and can model semantic relationships as well as generate natural language descriptions of the image. This makes them better suited for tasks that require both visual and textual understanding, such as image captioning and visual question answering (VQA). This article provides a comprehensive review of the research on VLMs in RS, summarizing the latest progress, highlighting current challenges, and identifying potential research opportunities. Specifically, we review the application of VLMs in mainstream RS tasks, including image captioning, text-based image generation, text-based image retrieval (TBIR), VQA, scene classification, semantic segmentation, and object detection. For each task, we analyze representative works and discuss research progress. Finally, we summarize the limitations of existing works and provide possible directions for future development. This review aims to provide a comprehensive overview of the current research progress of VLMs in RS (see Figure 1), and to inspire further research in this exciting and promising field.

Visual Large Language Models for Generalized and Specialized Applications

An Introduction to Vision-Language Modeling

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Vision-Language Models for Vision Tasks: A Survey

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks

Video Understanding with Large Language Models: A Survey

Enhancing Advanced Visual Reasoning Ability of Large Language Models

On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Vision-Language Models in Remote Sensing: Current progress and future trends

CogVLM2: Visual Language Models for Image and Video Understanding

Visually-Augmented Language Modeling

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Vision Language Models in Autonomous Driving and Intelligent Transportation Systems

Rethinking VLMs and LLMs for Image Classification

Towards Vision-Language Geo-Foundation Model: A Survey

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks