AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies

Bo-Wen Zhang,Liangdong Wang,Ye Yuan,Jijie Li,Shuhao Gu,Mengdi Zhao,Xinya Wu,Guang Liu,Chengwei Wu,Hanyu Zhao,Li Du,Yiming Ju,Quanyue Ma,Yulong Ao,Yingli Zhao,Songhe Zhu,Zhou Cao,Dong Liang,Yonghua Lin,Ming Zhang,Shunfei Wang,Yanxin Zhou,Min Ye,Xuekai Chen,Xinyang Yu,Xiangjun Huang,Jian Yang
2024-08-13
Abstract:In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention. In this paper, we present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model that has 8 experts with 16 billion parameters each and is developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and continuous pretraining with significantly less data. The second stage, Scale-Out, uses a pre-trained dense model to initialize the MoE experts, further enhancing knowledge transfer and performance. Extensive validation experiments on 1.8B and 7B models compared various initialization schemes, achieving models that maintain and reduce loss during continuous pretraining. Utilizing the optimal scheme, we successfully trained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily addresses several key issues in large-scale language model training: 1. **Resource Consumption**: With the widespread application of large language models in various fields, the model size is gradually increasing, and the resources required for pre-training are growing exponentially. Training a large language model from scratch requires a significant amount of computational resources, whereas expanding from a smaller model is more efficient. 2. **Data Requirements and Computational Costs**: Training large-scale models (including Mixture of Experts (MoE) architectures) faces resource-intensive challenges in data collection and processing, as well as high time and computational costs. Traditional training methods require a large amount of data, which is not only time-consuming but also demands high hardware specifications, making it difficult for resource-limited institutions to achieve. 3. **Training Efficiency and Performance Optimization**: Training large-scale models from scratch can take weeks or even months, delaying the experimentation and iteration process. Additionally, improper initialization or inefficient training strategies can lead to poor model performance, resulting in resource wastage. To address the above issues, the paper proposes AquilaMoE, a bilingual 8*16B Mixture of Experts (MoE) language model constructed using an innovative training method called EfficientScale. EfficientScale optimizes model performance and minimizes data requirements through a two-stage process: - **Scale-Up Stage**: Utilizes the weights of a smaller pre-trained model to initialize a larger model, thereby achieving knowledge transfer and performing continuous pre-training with significantly less data. - **Scale-Out Stage**: Uses the pre-trained dense model to initialize MoE experts, further enhancing knowledge transfer and performance. Through these methods, the researchers successfully trained a model with 16B parameters, followed by training the 8*16B parameter AquilaMoE model, achieving significant improvements in both model performance and training efficiency.