Abstract:Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.

Effective Vision Transformer Training: A Data-Centric Perspective

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training.

Training Vision Transformers with only 2040 Images.

Optimizing Vision Transformers with Data-Free Knowledge Transfer

Super Vision Transformer

Automated Progressive Learning for Efficient Training of Vision Transformers

Auto-scaling Vision Transformers without Training

Denoising Vision Transformers

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

Improve Vision Transformers Training by Suppressing Over-smoothing

Towards Efficient Adversarial Training on Vision Transformers

Improving Vision Transformers by Revisiting High-Frequency Components

DeepViT: Towards Deeper Vision Transformer

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Vision Transformers with Patch Diversification

Improving Vision Transformers for Incremental Learning

Budgeted Training for Vision Transformer

Experts Weights Averaging: A New General Training Scheme for Vision Transformers

A Survey on Efficient Vision Transformers: Algorithms, Techniques, and Performance Benchmarking

A General and Efficient Training for Transformer via Token Expansion