MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

Xingkui Zhu,Yiran Guan,Dingkang Liang,Yuchao Chen,Yuliang Liu,Xiang Bai

2024-06-07

Abstract:The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for MoE models, hindering their adoption. To bridge this gap, we introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which repurposes dense checkpoints as initial weights for MoE models, thereby accelerating convergence, enhancing accuracy, and alleviating the computational burden of pre-training; (2) hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture for better integration of dense checkpoints, enhancing fine-tuning performance. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy when fine-tuning dense checkpoints into MoE models. Our code will be publicly available at <a class="link-external link-https" href="https://github.com/Adlith/MoE-Jetpack" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the challenges of training Mixed Expert (MoE) models, especially the lack of pre-trained models. MoE models have the potential to improve performance and computational efficiency due to their sparse activation characteristics, but they require a large amount of data and computational resources to train from scratch. Currently, most public resources such as timm mainly provide pre-training weights for dense models, and lack corresponding resources for MoE models. To tackle this issue, the paper proposes the MoE Jetpack method, which converts the checkpoint of a dense model into an MoE model, accelerating convergence and improving accuracy. MoE Jetpack consists of two key techniques: checkpoint recycling, which utilizes the pre-training weights of a dense model as the initial weights of the MoE model to speed up convergence; and spherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture to better integrate the checkpoint of a dense model and further improve fine-tuning performance. Experiments demonstrate that MoE Jetpack significantly improves the convergence speed and accuracy of transitioning from dense model checkpoints to MoE models in various visual tasks. Through these methods, MoE models can inherit pre-training knowledge, reduce training time and data requirements, and promote the widespread application of MoE models.

MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts

MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference

MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training

MoEtion: Efficient and Reliable Checkpointing for Mixture-of-Experts Models at Scale

Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training

Residual Mixture of Experts

MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (moe) Inference

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

MoE-CAP: Cost-Accuracy-Performance Benchmarking for Mixture-of-Experts Systems

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

MoEC: Mixture of Expert Clusters

ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts

APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes

ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling.

Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

ScheMoE

$\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts