Comparative Knowledge Distillation

Alex Wilf,Alex Tianyi Xu,Paul Pu Liang,Alexander Obolenskiy,Daniel Fried,Louis-Philippe Morency

2023-11-04

Abstract:In the era of large scale pretrained models, Knowledge Distillation (KD) serves an important role in transferring the wisdom of computationally heavy teacher models to lightweight, efficient student models while preserving performance. Traditional KD paradigms, however, assume readily available access to teacher models for frequent inference -- a notion increasingly at odds with the realities of costly, often proprietary, large scale models. Addressing this gap, our paper considers how to minimize the dependency on teacher model inferences in KD in a setting we term Few Teacher Inference Knowledge Distillation (FTI KD). We observe that prevalent KD techniques and state of the art data augmentation strategies fall short in this constrained setting. Drawing inspiration from educational principles that emphasize learning through comparison, we propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. Critically, CKD provides additional learning signals to the student without making additional teacher calls. We also extend the principle of CKD to groups of samples, enabling even more efficient learning from limited teacher calls. Empirical evaluation across varied experimental settings indicates that CKD consistently outperforms state of the art data augmentation and KD techniques.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the issue of minimizing the number of calls to the teacher model during the process of Knowledge Distillation (KD). Specifically: 1. **Background and Motivation**: - With the development of large-scale pre-trained models, knowledge distillation has become particularly important as a method to transfer the knowledge of large models to smaller, more efficient models. - However, traditional knowledge distillation methods usually assume frequent access to the teacher model for inference, which is increasingly unrealistic in practical applications due to the high cost and proprietary nature of large models. 2. **Proposed Problem**: - How to effectively perform knowledge distillation under the condition of limited teacher model calls (referred to as Few-Teacher-Inference Knowledge Distillation, FTI-KD)? - Existing data augmentation techniques and knowledge distillation strategies perform poorly under such constrained settings. 3. **Solution**: - The paper proposes Comparative Knowledge Distillation (CKD), which improves performance by encouraging the student model to understand the subtle differences in sample interpretations by the teacher model. - The core idea of CKD is inspired by comparative learning methods in pedagogy, enhancing the learning signal of the student model by comparing the teacher model representations of different samples without additional calls to the teacher model. Experiments demonstrate that CKD significantly outperforms existing data augmentation and knowledge distillation techniques in various experimental settings, even improving accuracy by more than 7% under certain resource-constrained conditions. Additionally, CKD can be combined with intermediate layer loss functions to further enhance performance.

Comparative Knowledge Distillation

An Embarrassingly Simple Approach for Knowledge Distillation

QEKD: Query-Efficient and Data-Free Knowledge Distillation from Black-box Models.

Knowledge Condensation Distillation

Collaborative Knowledge Distillation Via Multiknowledge Transfer.

Stage-by-stage Knowledge Distillation

Why does Knowledge Distillation Work? Rethink its Attention and Fidelity Mechanism

Role-Wise Data Augmentation for Knowledge Distillation

Reciprocal Teacher-Student Learning Via Forward and Feedback Knowledge Distillation

Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching

Knowledge Augmentation for Distillation: A General and Effective Approach to Enhance Knowledge Distillation

Improving Knowledge Distillation With a Customized Teacher

Deep Collective Knowledge Distillation

Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation

Learning from a Lightweight Teacher for Efficient Knowledge Distillation

Improving Knowledge Distillation with Teacher's Explanation

Revisiting Knowledge Distillation Via Label Smoothing Regularization

Collaborative Knowledge Distillation

Learning to Teach with Student Feedback

Attention and feature transfer based knowledge distillation