Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Ting Liu,Xuyang Liu,Siteng Huang,Liangtao Shi,Zunnan Xu,Yi Xin,Quanjun Yin,Xiaohong Liu
2024-08-29
Abstract:Parameter-efficient fine-tuning (PEFT) has emerged as a popular solution for adapting pre-trained Vision Transformer (ViT) models to downstream applications. While current PEFT methods have achieved parameter efficiency, they overlook the efficiency of computation and GPU memory during both fine-tuning and inference, falling short of practical requirements. In this paper, we propose \textbf{Sparse-Tuning}, a novel PEFT method that accounts for the information redundancy in images and videos to boost the above efficiency. By sparsely preserving the semantic-relevant tokens and merging irrelevant ones, Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead. To align our token sparsification strategy suitably with fine-tuning purposes, we further design Dense Adapters that establish dense connections from shallow layers to deeper layers. These Dense Adapters integrate multi-level local features to enrich the current tokens, improving both token preservation and model adaptation. Empirical results on VTAB-1K, three image datasets, and two video datasets show that our Sparse-Tuning reduces GFLOPs to \textbf{62\%-70\%} of the original ViT-B while achieving state-of-the-art performance. Source code is available at \url{<a class="link-external link-https" href="https://github.com/liuting20/Sparse-Tuning" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?