Abstract:Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We carry out the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end". Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. We further co-explore data and architecture sparsity for additional efficiency gains by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can sometimes improve the ViT accuracy rather than compromising it, making sparsity a tantalizing "free lunch". For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0.28% top-1 accuracy, and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings. Our codes are available at <a class="link-external link-https" href="https://github.com/VITA-Group/SViTE" rel="external noopener nofollow">this https URL</a>.

SparseMAE: Sparse Training Meets Masked Autoencoders.

Masked autoencoders are effective solution to transformer data-hungry

Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

Masked Autoencoders Are Scalable Vision Learners

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Teaching Masked Autoencoder With Strong Augmentations

Sparse and Hierarchical Masked Modeling for Convolutional Representation Learning

Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference

Bootstrapped Masked Autoencoders for Vision BERT Pretraining

Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization

AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

A Closer Look at Self-Supervised Lightweight Vision Transformers

SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners

Improving Visual Representations of Masked Autoencoders With Artifacts Suppression