Abstract:Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.

Trainable Weight Averaging: Efficient Training by Optimizing Historical Solutions.

Trainable Weight Averaging for Fast Convergence and Better Generalization

Hierarchical Weight Averaging for Deep Neural Networks

Understanding the Training Dynamics in Federated Deep Learning via Aggregation Weight Optimization

Extreme Learning Machine Combining Hidden-Layer Feature Weighting and Batch Training for Classification

Exponential Moving Average of Weights in Deep Learning: Dynamics and Benefits

IMWA: Iterative Model Weight Averaging Benefits Class-Imbalanced Learning Tasks

A Unified Analysis for Finite Weight Averaging

Adaptive Stochastic Weight Averaging

Exponential weight averaging as damped harmonic motion

Learning to Auto Weight: Entirely Data-Driven and Highly Efficient Weighting Framework

Effective Neural Network Training with a New Weighting Mechanism-Based Optimization Algorithm.

Experts Weights Averaging: A New General Training Scheme for Vision Transformers

GRAWA: Gradient-based Weighted Averaging for Distributed Training of Deep Learning Models

WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average

Switch EMA: A Free Lunch for Better Flatness and Sharpness

DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation

Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

Rethinking Weight-Averaged Model-merging

Weighted Aggregating Stochastic Gradient Descent for Parallel Deep Learning