Abstract:Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset. We release the code at: <a class="link-external link-https" href="https://github.com/google-research/distilling-step-by-step" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

This paper attempts to address the deployment challenges of large language models (LLMs) in real-world applications. Although LLMs perform excellently on few-shot or zero-shot tasks, their enormous model size and high computational cost make them difficult to deploy in practical scenarios. Specifically, deploying a 17.5 billion parameter LLM requires at least 350GB of GPU memory and specialized infrastructure support. Moreover, the current state-of-the-art LLMs contain over 50 billion parameters, further increasing memory and computational demands, making it unaffordable for most product teams, especially for applications requiring low-latency performance. To circumvent the deployment challenges of large models, researchers typically opt to deploy smaller specialized models. These small models are trained using two common methods: finetuning and distillation. Finetuning updates the pretrained small model with downstream human-labeled data, while distillation trains the same small model using labels generated by the large LLM. However, both methods share a common issue: to achieve performance comparable to LLMs, finetuning requires expensive human labels, and distillation requires a large amount of unlabeled data, which is often difficult to obtain in practice. To address these issues, this paper proposes a new mechanism—Distilling step-by-step, aimed at training small models with superior performance to LLMs using less training data. The core idea is to use the reasoning process (rationales) generated by the LLM as additional supervision information to train the small model. In this way, step-by-step distillation not only reduces the amount of training data required but also significantly reduces the model size, thereby greatly lowering the cost of model deployment. Specifically, the paper demonstrates the following key results: 1. **Reducing training data**: Compared to traditional finetuning and distillation methods, step-by-step distillation can achieve better performance with less training data. 2. **Reducing model size**: Step-by-step distillation can use models much smaller than LLMs (up to 2000 times smaller) to achieve or even exceed the performance of LLMs. 3. **Simultaneously reducing data and model size**: Step-by-step distillation can reduce the model size while further reducing the amount of training data required, thereby surpassing the performance of LLMs in multiple benchmark tests. In summary, this paper proposes a new method that leverages the reasoning capabilities of LLMs to train small models more efficiently, thereby solving the deployment challenges of LLMs in real-world applications.

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

Mixed Distillation Helps Smaller Language Model Better Reasoning

Mixed Distillation Helps Smaller Language Models Reason Better

LLAVADI: What Matters For Multimodal Large Language Models Distillation

Leveraging Zero-Shot Prompting for Efficient Language Model Distillation

Effective Distillation of Table-based Reasoning Ability from LLMs

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation

On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

Task-agnostic Distillation of Encoder-Decoder Language Models

Instruction Distillation Makes Large Language Models Efficient Zero-shot Rankers

FIRST: Teach A Reliable Large Language Model Through Efficient Trustworthy Distillation

Self-Data Distillation for Recovering Quality in Pruned Large Language Models

Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data

Length-Adaptive Distillation: Customizing Small Language Model for Dynamic Token Pruning.

MiniLLM: Knowledge Distillation of Large Language Models

A Systematic Investigation of Distilling Large Language Models into Cross-Encoders for Passage Re-ranking

Teaching Small Language Models Reasoning Through Counterfactual Distillation