RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework With Consistency Regularization for Remote Sensing Image-Text Retrieval

Di Xiu,Luyan Ji,Xiurui Geng,Yirong Wu
DOI: https://doi.org/10.1109/lgrs.2024.3478176
IF: 5.343
2024-10-25
IEEE Geoscience and Remote Sensing Letters
Abstract:Vision-language models have demonstrated impressive capabilities in associating images and text by pretraining on extensive image-text paired data. The paradigm of continual pretraining followed by fine-tuning has become prevailing for boosting performance in domain-specific tasks under constrained computation resources. Benefiting from the superior generalization abilities of foundation models, the demands for computational resources and extensive data corpora have been significantly reduced. Nonetheless, it is crucial to tailor the model for the characteristics of downstream tasks to mitigate the misalignment between the pretraining pretext tasks and actual applications of interest. In this study, we utilize a CLIP-based model that has been continually pretrained on the 5 million image-text dataset in the remote sensing field as the foundation model, focusing on cross-modal image-text retrieval tasks. We introduce an efficient framework called remote sensing image-text retrieval fine-grained fine-tuning (RSITR-FFT), which refines the feature space by introducing fine-grained word-region alignment and incorporating consistency constraint regularization terms in the learning objectives. The fine-grained alignment aims for precise word-region correspondence beyond classical global-level image-text matching, while the consistency regularization encourages geometric coherence between the image and text modalities. Remarkably, our method achieves observable performance improvements while requiring far fewer fine-tuning samples—about 10 000, in contrast to the 400 million and 5 million samples used during the CLIP's initial pretraining and GeoRSCLIP's continual pretraining stages, respectively. We perform quantitative evaluation on RSICD, NWPU-Captions, and UCM-Captions datasets to demonstrate the effectiveness of RSITR-FFT. We further showcase its realistic application on the high-resolution remote sensing imagery through the qualitative visualization experiments on the FAIR1M-1.0 dataset. The code and models are available at https://github.com/d1x1u/RSITR-FFT.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?