BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation

Minchong Li,Feng Zhou,Xiaohui Song
2024-09-11
Abstract:In recent years, large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks. However, such impressive performance often comes with the trade-off of an increased parameter size, posing significant challenges for widespread deployment. Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model. In this paper, we explore the task-specific distillation of LLMs at the logit level. Our investigation reveals that the logits of fine-tuned LLMs exhibit a more extreme long-tail distribution than those from vision models, with hidden "noise" in the long tail affecting distillation performance. Furthermore, existing logits distillation methods often struggle to effectively utilize the internal ranking information from the logits. To address these, we propose the Bi-directional Logits Difference (BiLD) loss. The BiLD loss filters out the long-tail noise by utilizing only top-$k$ teacher and student logits, and leverages the internal logits ranking information by constructing logits differences. To evaluate BiLD loss, we conduct comprehensive experiments on 13 datasets using two types of LLMs. Our results show that the BiLD loss, with only the top-8 logits, outperforms supervised fine-tuning (SFT), vanilla KL loss, and five other distillation methods from both NLP and CV fields.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Although large language models (LLMs) perform excellently in various natural language processing (NLP) tasks, their large number of parameters places high demands on computing resources, which limits their wide application. Knowledge Distillation (KD) is a method to reduce the model size and maintain performance by transferring the knowledge of a large teacher model to a small student model. However, in the existing logits distillation methods, there are two main problems: 1. **Noise in the long - tailed distribution**: The logits of fine - tuned LLMs show a more extreme long - tailed distribution, and the hidden "noise" in it will affect the distillation performance. 2. **Insufficient utilization of internal ranking information**: Existing methods fail to effectively utilize the internal ranking information in logits. To solve these problems, the paper proposes a new loss function - Bi - directional Logits Difference (BiLD loss). This method filters out long - tailed noise by using only the top - k teacher and student logits, and utilizes the internal ranking information by constructing logits differences. Experimental results show that BiLD loss significantly outperforms Supervised Fine - Tuning (SFT), the traditional KL loss, and five other distillation methods on multiple NLP datasets. ### Formula Summary - **Logits Representation**: \[ z_t=[z_{t1}, z_{t2},\cdots, z_{tN}]\in\mathbb{R}^{1\times N} \] \[ z_s=[z_{s1}, z_{s2},\cdots, z_{sN}]\in\mathbb{R}^{1\times N} \] - **Probability Conversion**: \[ p_t = \frac{\exp(z_t/T)}{\sum_{i = 1}^N\exp(z_{ti}/T)} \] \[ p_s=\frac{\exp(z_s/T)}{\sum_{i = 1}^N\exp(z_{si}/T)} \] - **KL Divergence**: \[ L_{KL}=D_{KL}(p_t\|p_s) \] - **BiLD loss Definition**: - Teacher - led logits difference (t - LD loss): \[ L_{t - LD}=D_{KL}(p_{t - led}\|p_{s - cor}) \] - Student - led logits difference (s - LD loss): \[ L_{s - LD}=D_{KL}(p_{t - cor}\|p_{s - led}) \] - Total BiLD loss: \[ L_{BiLD}=L_{t - LD}+L_{s - LD} \] ### Conclusion By introducing BiLD loss, this research not only effectively solves the long - tailed noise problem in LLMs logits distillation, but also fully utilizes the internal ranking information of logits, thereby improving the learning effect of the student model. Experimental results verify the superiority of BiLD loss in various NLP tasks.