Abstract:Remote sensing image scene classification (RSI-SC) is crucial for various high-level applications, including RSI retrieval, image captioning, and object detection. Deep learning-based methods can accurately predict scene categories. However, these approaches often require numerous labeled samples for training, limiting their practicality in real-world RS applications with scarce label resources. In contrast, few-shot remote sensing image scene classification (FS-RSI-SC) has garnered substantial research interest owing to its potential to mitigate the need for extensive training samples. In recent years, there has been a surge in studies on FS-RSI-SC. This paper presents a comprehensive overview of FS-RSI-SC research, categorizing existing methods into two groups. The first group comprises approaches based on data augmentation, transfer learning, metric learning, and meta-learning. Our analysis reveals that most existing FS-RSI-SC methods fall into the meta-learning category, employing attention mechanisms, self-supervised learning (SSL), and feature fusion techniques for enhanced performance. Additionally, transfer learning-based methods consistently outperform other approaches in this category. The second group is centered around large-scale pre-training, which has demonstrated remarkable competitiveness across various tasks, including FS-RSI-SC. This special group of methods has shown considerable potential and is expected to attract more attention with the increasing popularity of large-scale pre-training and the unimodal and multimodal foundation models. Moreover, we proposed a pipeline that harnesses the capabilities of powerful large vision-language models (VLMs) as image encoders, establishing new baselines for FS-RSI-SC on commonly used datasets under standard experimental settings. Our empirical results validated the effectiveness of utilizing large VLMs and highlighted their potential for FS-RSI-SC. Through a joint analysis of state-of-the-art methods and our experiments with VLMs, we identified the prevailing challenges in FS-RSI-SC and outlined promising directions for future research.

LVM-StARS: Large Vision Model Soft Adaption for Remote Sensing Scene Classification

A lightweight and stochastic depth residual attention network for remote sensing scene classification

Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Advancing Plain Vision Transformer Toward Remote Sensing Foundation Model

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

A-VL: Adaptive Attention for Large Vision-Language Models

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images

RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

Few-shot remote sensing image scene classification: Recent advances, new baselines, and future trends

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Vision-Language Models in Remote Sensing: Current progress and future trends

Deep Semantic-Visual Alignment for Zero-Shot Remote Sensing Image Scene Classification

RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts

ForestDet: Large-Vocabulary Long-Tailed Object Detection and Instance Segmentation