ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

Samar Khanna,Medhanie Irgau,David B. Lobell,Stefano Ermon
2024-10-06
Abstract:Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this new domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 7.5% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines, including PEFT and unfreezing more ViT blocks. Code is available on the project website: <a class="link-external link-https" href="https://samar-khanna.github.io/ExPLoRA/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to efficiently adapt pre - trained Vision Transformers (ViTs) in new visual domains (such as satellite images, medical images, etc.) to deal with the significant domain shift between these domains and the natural image domain. Specifically, the authors propose a parameter - efficient extended pre - training method - ExPLoRA, to improve the transfer learning performance of Vision Transformers in new domains. ### Problem Background 1. **Domain Shift Problem**: - Current visual foundation models (VFMs) such as DinoV2 and MAE are mainly pre - trained on large - scale natural image datasets. - These models perform poorly when applied to other visual domains (such as satellite images, medical images, etc.) because there is a significant domain shift between these domains and the natural image domain. 2. **Limitations of Traditional Methods**: - The traditional solution is to perform pre - training from scratch for each new domain, which requires a large amount of computing resources and time, and cannot fully utilize the knowledge in the natural image pre - trained model. - Parameter - efficient fine - tuning (PEFT) techniques such as Low - Rank Adaptation (LoRA) can perform fine - tuning with a small number of parameters in downstream tasks, but their effectiveness is limited in cases of large domain shift. ### The Solution Proposed in the Paper The authors propose ExPLoRA (Parameter - Efficient Extended Pre - Training to Adapt Vision Transformers Under Domain Shifts), aiming to solve the above problems in the following ways: 1. **Extended Pre - training Phase**: - Initialize the model with ViT weights pre - trained on the natural image dataset. - Continue unsupervised pre - training in the new domain, only unfreezing 1 - 2 pre - trained ViT blocks, and use LoRA to adjust the weights of other layers. 2. **Parameter - Efficient Fine - Tuning**: - After pre - training, only use LoRA to perform supervised fine - tuning in the new domain. ### Experimental Results Through experiments, the authors demonstrate the superior performance of ExPLoRA in multiple domains: - **Satellite Images**: On the fMoW - RGB dataset, ExPLoRA achieves a top - 1 accuracy of 79.15%, which is 8.2% higher than the best existing method, and only uses 6% of the ViT encoder parameters. - **Multispectral Satellite Images**: On the fMoW - Sentinel dataset, the performance of ExPLoRA is better than that of the method of completely pre - training from scratch, and only uses less than 10% of the parameters. - **Other Domains**: ExPLoRA also performs well in fields such as wildlife, medical, and agricultural images, verifying its wide applicability. ### Summary The core problem of this paper is to explore how to efficiently adapt pre - trained ViTs in new visual domains to deal with the domain shift problem. ExPLoRA successfully solves this problem by extending the pre - training phase and combining parameter - efficient fine - tuning techniques, and has achieved excellent experimental results in multiple domains.