Abstract:Vision-language models have demonstrated impressive capabilities in associating images and text by pretraining on extensive image-text paired data. The paradigm of continual pretraining followed by fine-tuning has become prevailing for boosting performance in domain-specific tasks under constrained computation resources. Benefiting from the superior generalization abilities of foundation models, the demands for computational resources and extensive data corpora have been significantly reduced. Nonetheless, it is crucial to tailor the model for the characteristics of downstream tasks to mitigate the misalignment between the pretraining pretext tasks and actual applications of interest. In this study, we utilize a CLIP-based model that has been continually pretrained on the 5 million image-text dataset in the remote sensing field as the foundation model, focusing on cross-modal image-text retrieval tasks. We introduce an efficient framework called remote sensing image-text retrieval fine-grained fine-tuning (RSITR-FFT), which refines the feature space by introducing fine-grained word-region alignment and incorporating consistency constraint regularization terms in the learning objectives. The fine-grained alignment aims for precise word-region correspondence beyond classical global-level image-text matching, while the consistency regularization encourages geometric coherence between the image and text modalities. Remarkably, our method achieves observable performance improvements while requiring far fewer fine-tuning samples—about 10 000, in contrast to the 400 million and 5 million samples used during the CLIP's initial pretraining and GeoRSCLIP's continual pretraining stages, respectively. We perform quantitative evaluation on RSICD, NWPU-Captions, and UCM-Captions datasets to demonstrate the effectiveness of RSITR-FFT. We further showcase its realistic application on the high-resolution remote sensing imagery through the qualitative visualization experiments on the FAIR1M-1.0 dataset. The code and models are available at https://github.com/d1x1u/RSITR-FFT.

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

RSGPT: A Remote Sensing Vision Language Model and Benchmark

RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework With Consistency Regularization for Remote Sensing Image-Text Retrieval

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal text-Image Retrieval

Large Language Models for Captioning and Retrieving Remote Sensing Images

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

HumanVLM: Foundation for Human-Scene Vision-Language Model

Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations

MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation

DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark

EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

Towards a multimodal framework for remote sensing image change retrieval and captioning

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques