Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval

Yuan Yuan,Yang Zhan,Zhitong Xiong
2023-08-24
Abstract:Vision-and-language pre-training (VLP) models have experienced a surge in popularity recently. By fine-tuning them on specific datasets, significant performance improvements have been observed in various tasks. However, full fine-tuning of VLP models not only consumes a significant amount of computational resources but also has a significant environmental impact. Moreover, as remote sensing (RS) data is constantly being updated, full fine-tuning may not be practical for real-world applications. To address this issue, in this work, we investigate the parameter-efficient transfer learning (PETL) method to effectively and efficiently transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task. To this end, we make the following contributions. 1) We construct a novel and sophisticated PETL framework for the RS image-text retrieval (RSITR) task, which includes the pretrained CLIP model, a multimodal remote sensing adapter, and a hybrid multi-modal contrastive (HMMC) learning objective; 2) To deal with the problem of high intra-modal similarity in RS data, we design a simple yet effective HMMC loss; 3) We provide comprehensive empirical studies for PETL-based RS image-text retrieval. Our results demonstrate that the proposed method is promising and of great potential for practical applications. 4) We benchmark extensive state-of-the-art PETL methods on the RSITR task. Our proposed model only contains 0.16M training parameters, which can achieve a parameter reduction of 98.9% compared to full fine-tuning, resulting in substantial savings in training costs. Our retrieval performance exceeds traditional methods by 7-13% and achieves comparable or better performance than full fine-tuning. This work can provide new ideas and useful insights for RS vision-language tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to effectively transfer the knowledge of visual - language pre - training models (VLP) in the natural scene domain to remote sensing (RS) image - text retrieval tasks under the premise of parameter efficiency. Specifically, the paper focuses on reducing the huge computational resource consumption and environmental impact required for full - fine - tuning through the parameter - efficient transfer learning (PETL) method, while maintaining or improving the performance of remote sensing image - text retrieval tasks. The paper points out that although large - scale visual - language pre - training models (such as CLIP) perform excellently in multi - modal representation, directly performing full - fine - tuning on these models to adapt to remote sensing data is not only computationally costly but also has a large environmental impact. In addition, since remote sensing data is constantly updated, continuous full - fine - tuning is also impractical in practical applications. Therefore, the paper proposes a new parameter - efficient transfer learning framework, aiming to effectively transfer visual - language knowledge in the natural scene domain to the remote sensing domain, especially in image - text retrieval tasks. To solve the above problems, the paper makes the following contributions: 1. Constructed a novel and complex PETL framework for remote sensing image - text retrieval tasks, which includes a pre - trained CLIP model, a multi - modal remote sensing adapter, and a hybrid multi - modal contrast (HMMC) learning objective. 2. Designed a simple and effective HMMC loss function to deal with the high intra - modal similarity problem in remote sensing data. 3. Provided comprehensive empirical research, demonstrating the potential of the proposed method in practical applications. 4. Conducted benchmark tests on existing advanced PETL methods in remote sensing image - text retrieval tasks. The proposed method contains only 0.16M training parameters, reducing the number of parameters by 98.9% compared to full - fine - tuning, greatly saving training costs. Its retrieval performance is 7 - 13% higher than that of traditional methods, and it achieves performance equivalent to or better than that of full - fine - tuning. Through these contributions, the paper provides new ideas and useful insights for remote sensing visual - language tasks.