Abstract:In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. However, the large size and high computation demands of LLMs limit their practicality in many applications, especially when further fine-tuning is required. To address these limitations, smaller models are typically preferred for deployment. However, their training is hindered by the scarcity of labeled data. In contrast, unlabeled data is often readily which can be leveraged by using LLMs to generate pseudo-labels for training smaller models. This enables the smaller models (student) to acquire knowledge from LLMs(teacher) while reducing computational costs. This process introduces challenges, such as potential noisy pseudo-labels. Selecting high-quality and informative data is therefore critical to enhance model performance while improving the efficiency of data utilization. To address this, we propose LLKD that enables Learning with Less computational resources and less data for Knowledge Distillation from LLMs. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Specifically, it prioritizes samples where the teacher demonstrates high confidence in its labeling, indicating reliable labels, and where the student exhibits a high information need, identifying challenging samples that require further learning. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: In natural language processing (NLP) applications, although large language models (LLMs) perform excellently, due to their large scale and high computational requirements, their deployment in practical applications is limited, especially in cases where further fine - tuning is required. To solve these problems, smaller models are usually used for deployment, but the training of these models is restricted by the scarcity of labeled data. In contrast, unlabeled data is often more easily accessible, and LLMs can be utilized to generate pseudo - labels to train smaller models. However, this process faces challenges, such as the potential problem of noisy pseudo - labels. Therefore, selecting high - quality and information - rich data is crucial for improving model performance and data utilization efficiency. To solve the above - mentioned problems, the paper proposes LLKD (Learning with Less Knowledge Distillation), an adaptive sample selection method that combines the signals of the teacher model (LLMs) and the student model (smaller model). Specifically, LLKD gives priority to selecting samples for which the teacher model is highly confident in their labels, as well as samples for which the student model shows a high need for information. This method not only reduces the amount of required training data but also improves data utilization efficiency, thereby achieving superior performance on various datasets. ### Main contributions: 1. **Proposed an adaptive sample selection method**: LLKD selects high - quality and useful data samples for the student model by combining the confidence of the teacher model and the uncertainty of the student model. 2. **Improved data utilization efficiency**: By reducing the amount of required training data, data utilization efficiency is improved while maintaining or enhancing model performance. 3. **Verification on multiple datasets**: The effectiveness and superiority of LLKD are verified through experiments on multiple datasets. ### Method overview: - **Teacher model**: Use powerful LLMs (such as LLaMA) to generate pseudo - labels and calculate the confidence of each sample. - **Student model**: Use a smaller pre - trained language model (such as RoBERTa) for training and calculate the uncertainty of each sample. - **Data selection**: Select high - quality and information - rich samples for training according to the confidence of the teacher model and the uncertainty of the student model. ### Experimental results: - **Classification performance**: LLKD significantly outperforms the baseline method on multiple datasets. In particular, on the Pubmed - rct - 20k dataset, the F1 score is relatively increased by 6.25%. - **Data efficiency**: In most cases, LLKD only needs to select about 20% of the training samples, and on some datasets (such as PubMed - RCT - 20k), it can even achieve significant performance improvement with only 3.7% of the samples. In conclusion, through proposing an effective adaptive sample selection method, this paper solves the challenges faced when using LLMs to generate pseudo - labels to train smaller models, and significantly improves model performance and data utilization efficiency.

Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data

MiniLLM: Knowledge Distillation of Large Language Models

DDK: Distilling Domain Knowledge for Efficient Large Language Models

Evolving Knowledge Distillation with Large Language Models and Active Learning

A Survey on Knowledge Distillation of Large Language Models

Knowledge Distillation of LLM for Automatic Scoring of Science Education Assessments

Direct Preference Knowledge Distillation for Large Language Models

Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application

Knowledge Distillation Meets Label Noise Learning: Ambiguity-Guided Mutual Label Refinery

MiniPLM: Knowledge Distillation for Pre-Training Language Models

Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment

Knowledge Distillation of Black-Box Large Language Models

LLMR: Knowledge Distillation with a Large Language Model-Induced Reward

Learning from a Lightweight Teacher for Efficient Knowledge Distillation

Pre-training Distillation for Large Language Models: A Design Space Exploration

Knowledge Distillation Using Frontier Open-source LLMs: Generalizability and the Role of Synthetic Data

Using Advanced LLMs to Enhance Smaller LLMs: An Interpretable Knowledge Distillation Approach

BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation

Dynamic Knowledge Distillation for Pre-trained Language Models

PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models