Abstract:In recent years, large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks. However, such impressive performance often comes with the trade-off of an increased parameter size, posing significant challenges for widespread deployment. Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model. In this paper, we explore the task-specific distillation of LLMs at the logit level. Our investigation reveals that the logits of fine-tuned LLMs exhibit a more extreme long-tail distribution than those from vision models, with hidden "noise" in the long tail affecting distillation performance. Furthermore, existing logits distillation methods often struggle to effectively utilize the internal ranking information from the logits. To address these, we propose the Bi-directional Logits Difference (BiLD) loss. The BiLD loss filters out the long-tail noise by utilizing only top-$k$ teacher and student logits, and leverages the internal logits ranking information by constructing logits differences. To evaluate BiLD loss, we conduct comprehensive experiments on 13 datasets using two types of LLMs. Our results show that the BiLD loss, with only the top-8 logits, outperforms supervised fine-tuning (SFT), vanilla KL loss, and five other distillation methods from both NLP and CV fields.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Although large language models (LLMs) perform excellently in various natural language processing (NLP) tasks, their large number of parameters places high demands on computing resources, which limits their wide application. Knowledge Distillation (KD) is a method to reduce the model size and maintain performance by transferring the knowledge of a large teacher model to a small student model. However, in the existing logits distillation methods, there are two main problems: 1. **Noise in the long - tailed distribution**: The logits of fine - tuned LLMs show a more extreme long - tailed distribution, and the hidden "noise" in it will affect the distillation performance. 2. **Insufficient utilization of internal ranking information**: Existing methods fail to effectively utilize the internal ranking information in logits. To solve these problems, the paper proposes a new loss function - Bi - directional Logits Difference (BiLD loss). This method filters out long - tailed noise by using only the top - k teacher and student logits, and utilizes the internal ranking information by constructing logits differences. Experimental results show that BiLD loss significantly outperforms Supervised Fine - Tuning (SFT), the traditional KL loss, and five other distillation methods on multiple NLP datasets. ### Formula Summary - **Logits Representation**: \[ z_t=[z_{t1}, z_{t2},\cdots, z_{tN}]\in\mathbb{R}^{1\times N} \] \[ z_s=[z_{s1}, z_{s2},\cdots, z_{sN}]\in\mathbb{R}^{1\times N} \] - **Probability Conversion**: \[ p_t = \frac{\exp(z_t/T)}{\sum_{i = 1}^N\exp(z_{ti}/T)} \] \[ p_s=\frac{\exp(z_s/T)}{\sum_{i = 1}^N\exp(z_{si}/T)} \] - **KL Divergence**: \[ L_{KL}=D_{KL}(p_t\|p_s) \] - **BiLD loss Definition**: - Teacher - led logits difference (t - LD loss): \[ L_{t - LD}=D_{KL}(p_{t - led}\|p_{s - cor}) \] - Student - led logits difference (s - LD loss): \[ L_{s - LD}=D_{KL}(p_{t - cor}\|p_{s - led}) \] - Total BiLD loss: \[ L_{BiLD}=L_{t - LD}+L_{s - LD} \] ### Conclusion By introducing BiLD loss, this research not only effectively solves the long - tailed noise problem in LLMs logits distillation, but also fully utilizes the internal ranking information of logits, thereby improving the learning effect of the student model. Experimental results verify the superiority of BiLD loss in various NLP tasks.

BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation

Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

Pre-training Distillation for Large Language Models: A Design Space Exploration

MiniLLM: Knowledge Distillation of Large Language Models

Direct Preference Knowledge Distillation for Large Language Models

Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment

DDK: Distilling Domain Knowledge for Efficient Large Language Models

LLAVADI: What Matters For Multimodal Large Language Models Distillation

LLMR: Knowledge Distillation with a Large Language Model-Induced Reward

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation

DistiLLM: Towards Streamlined Distillation for Large Language Models

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs

Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data

Boosting Knowledge Distillation Via Intra-class Logit Distribution Smoothing

Multi-perspective Contrastive Logit Distillation

Decoupling Dark Knowledge via Block-wise Logit Distillation for Feature-level Alignment

A Survey on Knowledge Distillation of Large Language Models

Bridging the Gap between Decision and Logits in Decision-based Knowledge Distillation for Pre-trained Language Models