Pre-training Distillation for Large Language Models: A Design Space Exploration

Hao Peng,Xin Lv,Yushi Bai,Zijun Yao,Jiajie Zhang,Lei Hou,Juanzi Li

2024-10-22

Abstract:Knowledge distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. Previous work applying KD in the field of large language models (LLMs) typically focused on the post-training phase, where the student LLM learns directly from instructions and corresponding responses generated by the teacher model. In this paper, we extend KD to the pre-training phase of LLMs, named pre-training distillation (PD). We first conduct a preliminary experiment using GLM-4-9B as the teacher LLM to distill a 1.9B parameter student LLM, validating the effectiveness of PD. Considering the key impact factors of distillation, we systematically explore the design space of pre-training distillation across four aspects: logits processing, loss selection, scaling law, and offline or online logits. We conduct extensive experiments to explore the design space of pre-training distillation and find better configurations and interesting conclusions, such as larger student LLMs generally benefiting more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. We hope our exploration of the design space will inform future practices in pre-training distillation.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the effectiveness and optimization methods of applying Knowledge Distillation (KD) technology in the pre - training stage of large - scale language models (LLMs). Specifically, the paper explores how to improve the performance of the student model by transferring knowledge from a larger teacher model to a smaller student model during the pre - training stage. Traditional knowledge distillation is usually applied in the post - training stage, that is, the student model directly learns from the instructions generated by the teacher model and their responses. However, this paper extends this process to the pre - training stage, called Pre - training Distillation (PD), and systematically explores the design space of pre - training distillation, including logits processing, loss function selection, scaling laws, and offline or online logits acquisition strategies, etc., in order to find better configurations and interesting conclusions, such as larger student models usually benefit more from pre - training distillation, and larger teacher models do not necessarily guarantee better results. Through these explorations, the author hopes to provide guidance for future pre - training distillation practices.

Pre-training Distillation for Large Language Models: A Design Space Exploration

Direct Preference Knowledge Distillation for Large Language Models

DDK: Distilling Domain Knowledge for Efficient Large Language Models

BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation

Dynamic Knowledge Distillation for Pre-trained Language Models

Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment

A Survey on Knowledge Distillation of Large Language Models

Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application

MiniLLM: Knowledge Distillation of Large Language Models

LLAVADI: What Matters For Multimodal Large Language Models Distillation

Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation

PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs

Gradient Knowledge Distillation for Pre-trained Language Models

Dual-Space Knowledge Distillation for Large Language Models

Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models

MiniPLM: Knowledge Distillation for Pre-Training Language Models

DistiLLM: Towards Streamlined Distillation for Large Language Models

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

Revisiting Knowledge Distillation for Autoregressive Language Models

Knowledge Distillation of Black-Box Large Language Models

Using Advanced LLMs to Enhance Smaller LLMs: An Interpretable Knowledge Distillation Approach