Hierarchical Side-Tuning for Vision Transformers

Weifeng Lin,Ziheng Wu,Wentao Yang,Mingxin Huang,Jun Huang,Lianwen Jin

2024-05-16

Abstract:Fine-tuning pre-trained Vision Transformers (ViTs) has showcased significant promise in enhancing visual recognition tasks. Yet, the demand for individualized and comprehensive fine-tuning processes for each task entails substantial computational and memory costs, posing a considerable challenge. Recent advancements in Parameter-Efficient Transfer Learning (PETL) have shown potential for achieving high performance with fewer parameter updates compared to full fine-tuning. However, their effectiveness is primarily observed in simple tasks like image classification, while they encounter challenges with more complex vision tasks like dense prediction. To address this gap, this study aims to identify an effective tuning method that caters to a wider range of visual tasks. In this paper, we introduce Hierarchical Side-Tuning (HST), an innovative PETL method facilitating the transfer of ViT models to diverse downstream tasks. Diverging from existing methods that focus solely on fine-tuning parameters within specific input spaces or modules, HST employs a lightweight Hierarchical Side Network (HSN). This network leverages intermediate activations from the ViT backbone to model multi-scale features, enhancing prediction capabilities. To evaluate HST, we conducted comprehensive experiments across a range of visual tasks, including classification, object detection, instance segmentation, and semantic segmentation. Remarkably, HST achieved state-of-the-art performance in 13 out of the 19 tasks on the VTAB-1K benchmark, with the highest average Top-1 accuracy of 76.1%, while fine-tuning a mere 0.78M parameters. When applied to object detection and semantic segmentation tasks on the COCO and ADE20K testdev benchmarks, HST outperformed existing PETL methods and even surpassed full fine-tuning.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem addressed by this paper is how to effectively and parameter-efficiently fine-tune pre-trained Vision Transformer (ViT) models to adapt to various complex vision tasks, such as image classification, object detection, instance segmentation, and semantic segmentation. Current methods have limited performance in handling these complex tasks, especially for dense prediction tasks. To address this, the paper proposes a Hierarchical Side-Tuning (HST) method, which constructs a lightweight Hierarchical Side Network (HSN) to model multi-scale features using the intermediate activations of ViT, thereby improving prediction capability. Experimental results show that HST outperforms existing parameter-efficient transfer learning methods on multiple tasks and even surpasses full fine-tuning methods on certain tasks.

Hierarchical Side-Tuning for Vision Transformers

Lessons Learned from a Unifying Empirical Study of Parameter-Efficient Transfer Learning (PETL) in Visual Recognition

FacT: Factor-Tuning for Lightweight Adaptation on Vision Transformer

Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning

LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning

PVP: Pre-trained Visual Parameter-Efficient Tuning

Multiple-Exit Tuning: Towards Inference-Efficient Adaptation for Vision Transformer

Towards a Unified View on Visual Parameter-Efficient Transfer Learning

Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

HiDe-PET: Continual Learning via Hierarchical Decomposition of Parameter-Efficient Tuning

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

Dynamic Visual Prompt Tuning for Parameter Efficient Transfer Learning

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation

SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels

Scalable Vision Transformers with Hierarchical Pooling.

VioLET: Vision-Language Efficient Tuning with Collaborative Multi-modal Gradients

VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation

DTL: Disentangled Transfer Learning for Visual Recognition

E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning

Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation