MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks

Xingkui Zhu,Yiran Guan,Dingkang Liang,Yuchao Chen,Yuliang Liu,Xiang Bai
2024-06-07
Abstract:The sparsely activated mixture of experts (MoE) model presents a promising alternative to traditional densely activated (dense) models, enhancing both quality and computational efficiency. However, training MoE models from scratch demands extensive data and computational resources. Moreover, public repositories like timm mainly provide pre-trained dense checkpoints, lacking similar resources for MoE models, hindering their adoption. To bridge this gap, we introduce MoE Jetpack, an effective method for fine-tuning dense checkpoints into MoE models. MoE Jetpack incorporates two key techniques: (1) checkpoint recycling, which repurposes dense checkpoints as initial weights for MoE models, thereby accelerating convergence, enhancing accuracy, and alleviating the computational burden of pre-training; (2) hyperspherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture for better integration of dense checkpoints, enhancing fine-tuning performance. Our experiments on vision tasks demonstrate that MoE Jetpack significantly improves convergence speed and accuracy when fine-tuning dense checkpoints into MoE models. Our code will be publicly available at <a class="link-external link-https" href="https://github.com/Adlith/MoE-Jetpack" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the challenges of training Mixed Expert (MoE) models, especially the lack of pre-trained models. MoE models have the potential to improve performance and computational efficiency due to their sparse activation characteristics, but they require a large amount of data and computational resources to train from scratch. Currently, most public resources such as timm mainly provide pre-training weights for dense models, and lack corresponding resources for MoE models. To tackle this issue, the paper proposes the MoE Jetpack method, which converts the checkpoint of a dense model into an MoE model, accelerating convergence and improving accuracy. MoE Jetpack consists of two key techniques: checkpoint recycling, which utilizes the pre-training weights of a dense model as the initial weights of the MoE model to speed up convergence; and spherical adaptive MoE (SpheroMoE) layer, which optimizes the MoE architecture to better integrate the checkpoint of a dense model and further improve fine-tuning performance. Experiments demonstrate that MoE Jetpack significantly improves the convergence speed and accuracy of transitioning from dense model checkpoints to MoE models in various visual tasks. Through these methods, MoE models can inherit pre-training knowledge, reduce training time and data requirements, and promote the widespread application of MoE models.