Abstract:Transfer learning based on full fine-tuning (FFT) of the pre-trained encoder and task-specific decoder becomes increasingly complex as deep models grow exponentially. Parameter efficient fine-tuning (PEFT) approaches using adapters consisting of small learnable layers have emerged as an alternative to FFT, achieving comparable performance while maintaining high training efficiency. However, the inflexibility of the adapter with respect to input instances limits its capability of learning task-specific information in diverse downstream tasks. In this paper, we propose a novel PEFT approach, input-Conditioned transFormer, termed iConFormer, that leverages a dynamic adapter conditioned on the input instances. To secure flexible learning ability on input instances in various downstream tasks, we introduce an input-Conditioned Network (iCoN) in the dynamic adapter that enables instance-level feature transformation. To be specific, iCoN generates channel-wise convolutional kernels for each feature and transform it using adaptive convolution process to effectively capture task-specific and fine-grained details tailor to downstream tasks. Experimental results demonstrate that by tuning just 1.6% to 2.8% of the Transformer backbone parameters, iConFormer achieves performance comparable to FFT in monocular depth estimation and semantic segmentation, while outperforming it in image classification and instance segmentation. Also, the proposed method consistently outperforms recent PEFT methods for all the tasks mentioned above.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve the complexity and efficiency problems of deep - learning models in transfer learning, especially the challenges faced when fine - tuning on large - scale pre - trained models. Specifically, the paper focuses on the flexibility and performance issues of the Parameter Efficient Fine - Tuning (PEFT) method when dealing with diverse downstream tasks.
#### Background and problem description
As deep neural networks (DNNs) become more and more complex, although the Full Fine - Tuning (FFT) - based method can achieve good performance, its computational cost and resource consumption also increase significantly. The PEFT method can achieve performance comparable to that of FFT while maintaining high training efficiency by introducing small learnable modules (such as adapters). However, the existing PEFT methods have the following limitations:
1. **Insufficient flexibility for input instances**: Traditional adapters apply the same transformation to all input instances and cannot be flexibly adjusted according to different input features, which limits their ability to capture task - specific information in diverse downstream tasks.
2. **Limited ability to capture local details**: Backbone networks such as Vision Transformer (ViT) tend to focus on global information and ignore the fine - grained local features in the image, which has a negative impact on tasks that require fine - grained prediction (such as semantic segmentation, monocular depth estimation, etc.).
#### Proposed solutions
To solve the above problems, the paper proposes a new PEFT method - **iConFormer**, that is, the input - Conditioned Transformer. The main innovation of iConFormer lies in the introduction of an **Input - Conditioned Network (iCoN)**, which can dynamically generate convolution kernels for each input instance, thereby achieving more flexible and accurate feature extraction.
Specifically, iConFormer solves the limitations of the existing PEFT methods in the following ways:
- **Dynamically adapt to input instances**: iCoN dynamically generates convolution kernels according to input features, enabling the model to better capture the unique properties of the input data and enhancing its adaptability to different tasks.
- **Capture fine - grained local features**: By generating convolution kernels at the channel level, iConFormer can effectively capture local details in the image while maintaining parameter efficiency, improving the performance of dense prediction tasks.
#### Experimental results
The experimental results show that iConFormer can achieve performance comparable to or even better than that of FFT on multiple tasks by only fine - tuning 1.6% to 2.8% of the parameters in the Transformer backbone network, including:
- **Image classification**: On the CIFAR - 100, SVHN, and Food - 101 datasets, the Top - 1 accuracy of iConFormer is 4.5%, 3.36%, and 4.99% higher than that of the VPT method respectively, and exceeds the full - fine - tuning method.
- **Monocular depth estimation**: On the NYU - v2 dataset, the RMSE value of iConFormer is close to the result of full - fine - tuning and is better than other PEFT methods.
- **Semantic segmentation**: On the ADE20K dataset, the mIoU of iConFormer reaches 50.82%, close to the performance of full - fine - tuning, while using fewer parameters.
- **Instance segmentation**: On the COCO dataset, iConFormer exceeds other PEFT methods in both APBox and APMask metrics, and even surpasses the full - fine - tuning method.
In summary, by introducing a dynamic adaptation mechanism, iConFormer effectively improves the performance and flexibility of the PEFT method in diverse downstream tasks and solves the deficiencies of existing methods in input instance flexibility and local feature capture.