Abstract:Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We carry out the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end". Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. We further co-explore data and architecture sparsity for additional efficiency gains by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can sometimes improve the ViT accuracy rather than compromising it, making sparsity a tantalizing "free lunch". For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0.28% top-1 accuracy, and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings. Our codes are available at <a class="link-external link-https" href="https://github.com/VITA-Group/SViTE" rel="external noopener nofollow">this https URL</a>.

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

FacT: Factor-Tuning for Lightweight Adaptation on Vision Transformer

Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation

Expanding Sparse Tuning for Low Memory Usage

Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Visual Fourier Prompt Tuning

Effective and Efficient Few-shot Fine-tuning for Vision Transformers

GhostViT: Expediting Vision Transformers Via Cheap Operations

Enhancing Parameter-Efficient Fine-Tuning of Vision Transformers through Frequency-Based Adaptation

SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction

Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

Sparse then Prune: Toward Efficient Vision Transformers

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training.

An Empirical Study of Parameter Efficient Fine-tuning on Vision-Language Pre-train Model

Low-rank Attention Side-Tuning for Parameter-Efficient Fine-Tuning

SPViT: Enabling Faster Vision Transformers Via Latency-Aware Soft Token Pruning