Preparing Lessons for Progressive Training on Language Models

Yu Pan,Ye Yuan,Yichun Yin,Jiaxin Shi,Zenglin Xu,Ming Zhang,Lifeng Shang,Xin Jiang,Qun Liu
2024-02-10
Abstract:The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prep\textbf{a}res lessons for ex\textbf{p}anding \textbf{o}perations by \textbf{l}earning high-\textbf{l}ayer functi\textbf{o}nality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily addresses the issues of resource consumption and environmental impact in the current training of large language models, proposing a new method—Apollo, to improve the efficiency of training language models from scratch. The paper points out that as the scale of Transformer models grows, the resource consumption and greenhouse gas emissions during their training also increase. To tackle this challenge, researchers typically use pre-trained smaller models to initialize larger models, thereby improving training efficiency. However, this approach may not be suitable for newly designed model architectures, and training directly from scratch, although intuitive, is often inefficient. Therefore, the paper proposes the Apollo method, which aims to effectively extend the model's depth and achieve significant acceleration by allowing lower layers to learn higher-level functional features in the early stages of training. The core ideas of Apollo include: 1. **Low-Value Priority Sampling (LVPS)**: This is a sampling strategy that randomly selects a shallower depth for training at each training step, helping to ensure that the model can access information from different depth levels. 2. **Weight Sharing**: Through a weight-sharing mechanism, the weights of lower layers can adapt to the needs of different depths, which helps in learning higher-level functional features in advance. 3. **Interpolation Method**: An interpolation method is introduced to stably extend the model's depth, avoiding gradient issues that may arise from directly stacking layers. With these techniques, Apollo can effectively enhance model training efficiency while reducing time and financial costs. Experimental results show that Apollo achieves a best acceleration ratio of 41.6% in terms of FLOPs savings, even surpassing methods that utilize pre-trained models. Additionally, the paper compares and analyzes different layer extension methods (such as stacking and interpolation), ultimately choosing the interpolation method as the main extension strategy for Apollo due to its better stability. In summary, Apollo aims to efficiently extend model depth by preparing a "curriculum" that allows the lower layers to learn higher-level functional features in advance, thereby accelerating training speed and reducing resource consumption.