Abstract:Vision-language models have demonstrated impressive capabilities in associating images and text by pretraining on extensive image-text paired data. The paradigm of continual pretraining followed by fine-tuning has become prevailing for boosting performance in domain-specific tasks under constrained computation resources. Benefiting from the superior generalization abilities of foundation models, the demands for computational resources and extensive data corpora have been significantly reduced. Nonetheless, it is crucial to tailor the model for the characteristics of downstream tasks to mitigate the misalignment between the pretraining pretext tasks and actual applications of interest. In this study, we utilize a CLIP-based model that has been continually pretrained on the 5 million image-text dataset in the remote sensing field as the foundation model, focusing on cross-modal image-text retrieval tasks. We introduce an efficient framework called remote sensing image-text retrieval fine-grained fine-tuning (RSITR-FFT), which refines the feature space by introducing fine-grained word-region alignment and incorporating consistency constraint regularization terms in the learning objectives. The fine-grained alignment aims for precise word-region correspondence beyond classical global-level image-text matching, while the consistency regularization encourages geometric coherence between the image and text modalities. Remarkably, our method achieves observable performance improvements while requiring far fewer fine-tuning samples—about 10 000, in contrast to the 400 million and 5 million samples used during the CLIP's initial pretraining and GeoRSCLIP's continual pretraining stages, respectively. We perform quantitative evaluation on RSICD, NWPU-Captions, and UCM-Captions datasets to demonstrate the effectiveness of RSITR-FFT. We further showcase its realistic application on the high-resolution remote sensing imagery through the qualitative visualization experiments on the FAIR1M-1.0 dataset. The code and models are available at https://github.com/d1x1u/RSITR-FFT.

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning

DiffCLIP: Few-shot Language-driven Multimodal Classifier

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

Multilingual Vision-Language Pre-training for the Remote Sensing Domain

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment

RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework With Consistency Regularization for Remote Sensing Image-Text Retrieval

RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical Diagnostic Reports

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

Generative ConvNet Foundation Model With Sparse Modeling and Low-Frequency Reconstruction for Remote Sensing Image Interpretation

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction