A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

Ankit Singh Rawat,Veeranjaneyulu Sadhanala,Afshin Rostamizadeh,Ayan Chakrabarti,Wittawat Jitkrittum,Vladimir Feinberg,Seungyeon Kim,Hrayr Harutyunyan,Nikunj Saunshi,Zachary Nado,Rakesh Shivanna,Sashank J. Reddi,Aditya Krishna Menon,Rohan Anil,Sanjiv Kumar

2024-10-24

Abstract:A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

This paper attempts to solve the problem of high computational cost in the pre - training process of large - language models (LLMs). Specifically, the paper explores a new paradigm, that is, by rationally using small - language models (SLMs) to improve the efficiency and quality of LLM pre - training. This method mainly depends on the role of SLMs in two aspects: 1. **Providing soft labels as additional training supervision**: SLMs can generate prediction distributions, which can be used as soft labels to provide additional supervision information for LLMs. 2. **Selecting valuable training samples**: SLMs can help screen out those "information - rich" and "difficult" training samples, thus optimizing the selection of training data. Through these two ways, SLMs can effectively transfer their prediction distributions to LLMs and give priority to specific areas in the training data distribution. Experiments have proven that this method not only reduces the training time of LLMs but also improves the overall quality. In addition, the paper also theoretically develops a statistical framework to systematically study the role of SLMs in efficiently training high - quality LLMs. This framework specifically points out that although the supervision provided by SLMs seems to be of lower quality, the training effect of LLMs can be enhanced by balancing bias and variance. By using a 1.5 - billion - parameter SLM on the Pile dataset to assist in training a 2.8 - billion - parameter LLM, the paper demonstrates the effectiveness of this method.

A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs

How to Train Data-Efficient LLMs

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

New Solutions on LLM Acceleration, Optimization, and Application

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Exploring Design Choices for Building Language-Specific LLMs

Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE

LoLCATs: On Low-Rank Linearizing of Large Language Models

Small Language Models Improve Giants by Rewriting Their Outputs

LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

Sparsity-Accelerated Training for Large Language Models

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models

Learn To be Efficient: Build Structured Sparsity in Large Language Models