Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval

Yuan Yuan,Yang Zhan,Zhitong Xiong

2023-08-24

Abstract:Vision-and-language pre-training (VLP) models have experienced a surge in popularity recently. By fine-tuning them on specific datasets, significant performance improvements have been observed in various tasks. However, full fine-tuning of VLP models not only consumes a significant amount of computational resources but also has a significant environmental impact. Moreover, as remote sensing (RS) data is constantly being updated, full fine-tuning may not be practical for real-world applications. To address this issue, in this work, we investigate the parameter-efficient transfer learning (PETL) method to effectively and efficiently transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task. To this end, we make the following contributions. 1) We construct a novel and sophisticated PETL framework for the RS image-text retrieval (RSITR) task, which includes the pretrained CLIP model, a multimodal remote sensing adapter, and a hybrid multi-modal contrastive (HMMC) learning objective; 2) To deal with the problem of high intra-modal similarity in RS data, we design a simple yet effective HMMC loss; 3) We provide comprehensive empirical studies for PETL-based RS image-text retrieval. Our results demonstrate that the proposed method is promising and of great potential for practical applications. 4) We benchmark extensive state-of-the-art PETL methods on the RSITR task. Our proposed model only contains 0.16M training parameters, which can achieve a parameter reduction of 98.9% compared to full fine-tuning, resulting in substantial savings in training costs. Our retrieval performance exceeds traditional methods by 7-13% and achieves comparable or better performance than full fine-tuning. This work can provide new ideas and useful insights for RS vision-language tasks.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to effectively transfer the knowledge of visual - language pre - training models (VLP) in the natural scene domain to remote sensing (RS) image - text retrieval tasks under the premise of parameter efficiency. Specifically, the paper focuses on reducing the huge computational resource consumption and environmental impact required for full - fine - tuning through the parameter - efficient transfer learning (PETL) method, while maintaining or improving the performance of remote sensing image - text retrieval tasks. The paper points out that although large - scale visual - language pre - training models (such as CLIP) perform excellently in multi - modal representation, directly performing full - fine - tuning on these models to adapt to remote sensing data is not only computationally costly but also has a large environmental impact. In addition, since remote sensing data is constantly updated, continuous full - fine - tuning is also impractical in practical applications. Therefore, the paper proposes a new parameter - efficient transfer learning framework, aiming to effectively transfer visual - language knowledge in the natural scene domain to the remote sensing domain, especially in image - text retrieval tasks. To solve the above problems, the paper makes the following contributions: 1. Constructed a novel and complex PETL framework for remote sensing image - text retrieval tasks, which includes a pre - trained CLIP model, a multi - modal remote sensing adapter, and a hybrid multi - modal contrast (HMMC) learning objective. 2. Designed a simple and effective HMMC loss function to deal with the high intra - modal similarity problem in remote sensing data. 3. Provided comprehensive empirical research, demonstrating the potential of the proposed method in practical applications. 4. Conducted benchmark tests on existing advanced PETL methods in remote sensing image - text retrieval tasks. The proposed method contains only 0.16M training parameters, reducing the number of parameters by 98.9% compared to full - fine - tuning, greatly saving training costs. Its retrieval performance is 7 - 13% higher than that of traditional methods, and it achieves performance equivalent to or better than that of full - fine - tuning. Through these contributions, the paper provides new ideas and useful insights for remote sensing visual - language tasks.

Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval

VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation

Lessons Learned from a Unifying Empirical Study of Parameter-Efficient Transfer Learning (PETL) in Visual Recognition

RSITR-FFT: Efficient Fine-Grained Fine-Tuning Framework With Consistency Regularization for Remote Sensing Image-Text Retrieval

VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

Prior-Experience-Based Vision-Language Model for Remote Sensing Image-Text Retrieval

Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models

Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment

Dynamic Visual Prompt Tuning for Parameter Efficient Transfer Learning

FPT+: A Parameter and Memory Efficient Transfer Learning Method for High-resolution Medical Image Classification

PERS: Parameter-Efficient Multimodal Transfer Learning for Remote Sensing Visual Question Answering

Fine-grained Prompt Tuning: A Parameter and Memory Efficient Transfer Learning Method for High-resolution Medical Image Classification

One Network, Many Masks: Towards More Parameter-Efficient Transfer Learning

Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm

When Parameter-efficient Tuning Meets General-purpose Vision-language Models

ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts

Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval

SurgPETL: Parameter-Efficient Image-to-Surgical-Video Transfer Learning for Surgical Phase Recognition

An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal text-Image Retrieval