Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh,Chun-Liang Li,Chih-Kuan Yeh,Hootan Nakhost,Yasuhisa Fujii,Alexander Ratner,Ranjay Krishna,Chen-Yu Lee,Tomas Pfister
2023-07-06
Abstract:Deploying large language models (LLMs) is challenging because they are memory inefficient and compute-intensive for practical applications. In reaction, researchers train smaller task-specific models by either finetuning with human labels or distilling using LLM-generated labels. However, finetuning and distillation require large amounts of training data to achieve comparable performance to LLMs. We introduce Distilling step-by-step, a new mechanism that (a) trains smaller models that outperform LLMs, and (b) achieves so by leveraging less training data needed by finetuning or distillation. Our method extracts LLM rationales as additional supervision for training small models within a multi-task framework. We present three findings across 4 NLP benchmarks: First, compared to both finetuning and distillation, our mechanism achieves better performance with much fewer labeled/unlabeled training examples. Second, compared to few-shot prompted LLMs, we achieve better performance using substantially smaller model sizes. Third, we reduce both the model size and the amount of data required to outperform LLMs; our finetuned 770M T5 model outperforms the few-shot prompted 540B PaLM model using only 80% of available data on a benchmark, whereas standard finetuning the same T5 model struggles to match even by using 100% of the dataset. We release the code at: <a class="link-external link-https" href="https://github.com/google-research/distilling-step-by-step" rel="external noopener nofollow">this https URL</a> .
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the deployment challenges of large language models (LLMs) in real-world applications. Although LLMs perform excellently on few-shot or zero-shot tasks, their enormous model size and high computational cost make them difficult to deploy in practical scenarios. Specifically, deploying a 17.5 billion parameter LLM requires at least 350GB of GPU memory and specialized infrastructure support. Moreover, the current state-of-the-art LLMs contain over 50 billion parameters, further increasing memory and computational demands, making it unaffordable for most product teams, especially for applications requiring low-latency performance. To circumvent the deployment challenges of large models, researchers typically opt to deploy smaller specialized models. These small models are trained using two common methods: finetuning and distillation. Finetuning updates the pretrained small model with downstream human-labeled data, while distillation trains the same small model using labels generated by the large LLM. However, both methods share a common issue: to achieve performance comparable to LLMs, finetuning requires expensive human labels, and distillation requires a large amount of unlabeled data, which is often difficult to obtain in practice. To address these issues, this paper proposes a new mechanism—Distilling step-by-step, aimed at training small models with superior performance to LLMs using less training data. The core idea is to use the reasoning process (rationales) generated by the LLM as additional supervision information to train the small model. In this way, step-by-step distillation not only reduces the amount of training data required but also significantly reduces the model size, thereby greatly lowering the cost of model deployment. Specifically, the paper demonstrates the following key results: 1. **Reducing training data**: Compared to traditional finetuning and distillation methods, step-by-step distillation can achieve better performance with less training data. 2. **Reducing model size**: Step-by-step distillation can use models much smaller than LLMs (up to 2000 times smaller) to achieve or even exceed the performance of LLMs. 3. **Simultaneously reducing data and model size**: Step-by-step distillation can reduce the model size while further reducing the amount of training data required, thereby surpassing the performance of LLMs in multiple benchmark tests. In summary, this paper proposes a new method that leverages the reasoning capabilities of LLMs to train small models more efficiently, thereby solving the deployment challenges of LLMs in real-world applications.