Abstract:The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large models. In this paper, we explore an intriguing idea to connect these two different regimes: Can we develop a method to initialize large language models using smaller pre-trained models? Will such initialization bring any benefits in terms of training time and final accuracy? In this paper, we introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts. We demonstrate that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issues of high cost and time consumption during the pre-training of Large Language Models (LLMs). Specifically: 1. **Pre-training Cost Issue**: Current large language models typically start training from randomly initialized parameters, which is not only slow but also costly. For example, training a model with 12 billion parameters requires approximately 72,000 GPU hours, which is very expensive at public cloud computing resource prices. 2. **Trade-off Between Small and Large Models**: While small language models have lower training costs and less environmental impact, their accuracy often cannot reach the level of large models. Therefore, to pursue high performance, the industry tends to use larger models, but the training costs of these models are very high. To solve the above problems, the paper proposes a method called **HyperCloning**, which can expand from a smaller pre-trained model to a larger model, thereby significantly reducing training time and cost while retaining the functionality of the small model. Through this method, large models can inherit the predictive capabilities and accuracy of small models at the early stages of training and further improve performance through training. ### Main Contributions 1. **Function-Preserving Initialization**: Ensures that the large model can accurately replicate the functionality of the small model after initialization. 2. **Accelerated Training**: Experiments show that models initialized with HyperCloning converge faster and achieve higher final accuracy compared to traditional random initialization under the same training budget. 3. **Specific Implementation of Cloning Linear Layers**: Provides a detailed explanation of how to clone linear layers in different scenarios to achieve function preservation. 4. **Model Performance Analysis**: Validates the effectiveness and advantages of HyperCloning by comparing the effects of different expansion strategies.

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

Warmstarting for Scaling Language Models

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Weight subcloning: direct initialization of transformers using larger pretrained ones

Transfer training from smaller language model

Language models scale reliably with over-training and on downstream tasks

Emergent Abilities in Reduced-Scale Generative Language Models

Scaling Laws for Multilingual Language Models

Elixir: Train a Large Language Model on a Small GPU Cluster

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

Efficient Large-Scale Language Model Training on GPU Clusters

Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training

Scaling Laws for Neural Language Models

Need a Small Specialized Language Model? Plan Early!

Scaling Language-Image Pre-training via Masking

Scaling Language Model Size in Cross-Device Federated Learning

Scaling Laws for Pre-training Agents and World Models

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

bert2BERT: Towards Reusable Pretrained Language Models

On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models

Reproducible scaling laws for contrastive language-image learning