Abstract:The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions due to growing model sizes. Prior work suggests using pretrained small models to improve training efficiency, but this approach may not be suitable for new model structures. On the other hand, training from scratch can be slow, and progressively stacking layers often fails to achieve significant acceleration. To address these challenges, we propose a novel method called Apollo, which prep\textbf{a}res lessons for ex\textbf{p}anding \textbf{o}perations by \textbf{l}earning high-\textbf{l}ayer functi\textbf{o}nality during training of low layers. Our approach involves low-value-prioritized sampling (LVPS) to train different depths and weight sharing to facilitate efficient expansion. We also introduce an interpolation method for stable model depth extension. Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models, making it a universal and efficient solution for training deep models while reducing time, financial, and environmental costs.

What problem does this paper attempt to address?

The paper primarily addresses the issues of resource consumption and environmental impact in the current training of large language models, proposing a new method—Apollo, to improve the efficiency of training language models from scratch. The paper points out that as the scale of Transformer models grows, the resource consumption and greenhouse gas emissions during their training also increase. To tackle this challenge, researchers typically use pre-trained smaller models to initialize larger models, thereby improving training efficiency. However, this approach may not be suitable for newly designed model architectures, and training directly from scratch, although intuitive, is often inefficient. Therefore, the paper proposes the Apollo method, which aims to effectively extend the model's depth and achieve significant acceleration by allowing lower layers to learn higher-level functional features in the early stages of training. The core ideas of Apollo include: 1. **Low-Value Priority Sampling (LVPS)**: This is a sampling strategy that randomly selects a shallower depth for training at each training step, helping to ensure that the model can access information from different depth levels. 2. **Weight Sharing**: Through a weight-sharing mechanism, the weights of lower layers can adapt to the needs of different depths, which helps in learning higher-level functional features in advance. 3. **Interpolation Method**: An interpolation method is introduced to stably extend the model's depth, avoiding gradient issues that may arise from directly stacking layers. With these techniques, Apollo can effectively enhance model training efficiency while reducing time and financial costs. Experimental results show that Apollo achieves a best acceleration ratio of 41.6% in terms of FLOPs savings, even surpassing methods that utilize pre-trained models. Additionally, the paper compares and analyzes different layer extension methods (such as stacking and interpolation), ultimately choosing the interpolation method as the main extension strategy for Apollo due to its better stability. In summary, Apollo aims to efficiently extend model depth by preparing a "curriculum" that allows the lower layers to learn higher-level functional features in advance, thereby accelerating training speed and reducing resource consumption.

Preparing Lessons for Progressive Training on Language Models

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

A Multi-Level Framework for Accelerating Training Transformer Models

Transformer Layer Injection: A Novel Approach for Efficient Upscaling of Large Language Models

Understanding the Difficulty of Training Transformers

Learning to Grow Pretrained Models for Efficient Transformer Training

Efficiency optimization of large-scale language models based on deep learning in natural language processing tasks

Masked Structural Growth for 2x Faster Language Model Pre-training

COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models

Transfer training from smaller language model

Can pruning make Large Language Models more efficient?

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Scaling Pre-trained Language Models to Deeper Via Parameter-efficient Architecture

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Staged Training for Transformer Language Models

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

Mixed Sparsity Training: Achieving 4× FLOP Reduction for Transformer Pretraining

Deep Transformers with Latent Depth

Impossible Triangle: What's Next for Pre-trained Language Models?