Abstract:Vision-language models have demonstrated impressive capabilities in associating images and text by pretraining on extensive image-text paired data. The paradigm of continual pretraining followed by fine-tuning has become prevailing for boosting performance in domain-specific tasks under constrained computation resources. Benefiting from the superior generalization abilities of foundation models, the demands for computational resources and extensive data corpora have been significantly reduced. Nonetheless, it is crucial to tailor the model for the characteristics of downstream tasks to mitigate the misalignment between the pretraining pretext tasks and actual applications of interest. In this study, we utilize a CLIP-based model that has been continually pretrained on the 5 million image-text dataset in the remote sensing field as the foundation model, focusing on cross-modal image-text retrieval tasks. We introduce an efficient framework called remote sensing image-text retrieval fine-grained fine-tuning (RSITR-FFT), which refines the feature space by introducing fine-grained word-region alignment and incorporating consistency constraint regularization terms in the learning objectives. The fine-grained alignment aims for precise word-region correspondence beyond classical global-level image-text matching, while the consistency regularization encourages geometric coherence between the image and text modalities. Remarkably, our method achieves observable performance improvements while requiring far fewer fine-tuning samples—about 10 000, in contrast to the 400 million and 5 million samples used during the CLIP's initial pretraining and GeoRSCLIP's continual pretraining stages, respectively. We perform quantitative evaluation on RSICD, NWPU-Captions, and UCM-Captions datasets to demonstrate the effectiveness of RSITR-FFT. We further showcase its realistic application on the high-resolution remote sensing imagery through the qualitative visualization experiments on the FAIR1M-1.0 dataset. The code and models are available at https://github.com/d1x1u/RSITR-FFT.

Bootstrapping Interactive Image–Text Alignment for Remote Sensing Image Captioning

Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning

A Joint-Training Two-Stage Method For Remote Sensing Image Captioning.

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning

Multi-View Feature Fusion and Visual Prompt for Remote Sensing Image Captioning

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning

Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Improving OCR-based Image Captioning by Incorporating Geometrical Relationship

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

Learning Video-Text Aligned Representations for Video Captioning

IC3: Image Captioning by Committee Consensus

Improving Image Captioning through Visual and Semantic Mutual Promotion

Multi-label Semantic Feature Fusion for Remote Sensing Image Captioning

RSGPT: A Remote Sensing Vision Language Model and Benchmark

LuoJiaHOG: A Hierarchy Oriented Geo-aware Image Caption Dataset for Remote Sensing Image-Text Retrival

RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework With Consistency Regularization for Remote Sensing Image-Text Retrieval

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance