Abstract:The remarkable achievements of ChatGPT and Generative Pre-trained Transformer 4 (GPT-4) have sparked a wave of interest and research in the field of large language models (LLMs) for artificial general intelligence (AGI). These models provide intelligent solutions that are closer to human thinking, enabling us to use general artificial intelligence (AI) to solve problems in various applications. However, in the field of remote sensing (RS), the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research in RS focuses primarily on visual-understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-LMs (VLMs) excel as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. VLMs can go beyond visual recognition of RS images and can model semantic relationships as well as generate natural language descriptions of the image. This makes them better suited for tasks that require both visual and textual understanding, such as image captioning and visual question answering (VQA). This article provides a comprehensive review of the research on VLMs in RS, summarizing the latest progress, highlighting current challenges, and identifying potential research opportunities. Specifically, we review the application of VLMs in mainstream RS tasks, including image captioning, text-based image generation, text-based image retrieval (TBIR), VQA, scene classification, semantic segmentation, and object detection. For each task, we analyze representative works and discuss research progress. Finally, we summarize the limitations of existing works and provide possible directions for future development. This review aims to provide a comprehensive overview of the current research progress of VLMs in RS (see Figure 1), and to inspire further research in this exciting and promising field.

Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Autoregressive Models in Vision: A Survey

A Survey on Vision Autoregressive Model

Towards the Unification of Generative and Discriminative Visual Foundation Model: A Survey

A Survey for Foundation Models in Autonomous Driving

Vision Language Models in Autonomous Driving: A Survey and Outlook

Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Data-efficient Large Vision Models through Sequential Autoregression

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Vision-Language Models for Vision Tasks: A Survey

A Survey of Resource-efficient LLM and Multimodal Foundation Models

Foundational Models Defining a New Era in Vision: A Survey and Outlook

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Towards Vision-Language Geo-Foundation Model: A Survey

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

Vision-Language Models in Remote Sensing: Current progress and future trends

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework