Comparative Knowledge Distillation

Alex Wilf,Alex Tianyi Xu,Paul Pu Liang,Alexander Obolenskiy,Daniel Fried,Louis-Philippe Morency
2023-11-04
Abstract:In the era of large scale pretrained models, Knowledge Distillation (KD) serves an important role in transferring the wisdom of computationally heavy teacher models to lightweight, efficient student models while preserving performance. Traditional KD paradigms, however, assume readily available access to teacher models for frequent inference -- a notion increasingly at odds with the realities of costly, often proprietary, large scale models. Addressing this gap, our paper considers how to minimize the dependency on teacher model inferences in KD in a setting we term Few Teacher Inference Knowledge Distillation (FTI KD). We observe that prevalent KD techniques and state of the art data augmentation strategies fall short in this constrained setting. Drawing inspiration from educational principles that emphasize learning through comparison, we propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. Critically, CKD provides additional learning signals to the student without making additional teacher calls. We also extend the principle of CKD to groups of samples, enabling even more efficient learning from limited teacher calls. Empirical evaluation across varied experimental settings indicates that CKD consistently outperforms state of the art data augmentation and KD techniques.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issue of minimizing the number of calls to the teacher model during the process of Knowledge Distillation (KD). Specifically: 1. **Background and Motivation**: - With the development of large-scale pre-trained models, knowledge distillation has become particularly important as a method to transfer the knowledge of large models to smaller, more efficient models. - However, traditional knowledge distillation methods usually assume frequent access to the teacher model for inference, which is increasingly unrealistic in practical applications due to the high cost and proprietary nature of large models. 2. **Proposed Problem**: - How to effectively perform knowledge distillation under the condition of limited teacher model calls (referred to as Few-Teacher-Inference Knowledge Distillation, FTI-KD)? - Existing data augmentation techniques and knowledge distillation strategies perform poorly under such constrained settings. 3. **Solution**: - The paper proposes Comparative Knowledge Distillation (CKD), which improves performance by encouraging the student model to understand the subtle differences in sample interpretations by the teacher model. - The core idea of CKD is inspired by comparative learning methods in pedagogy, enhancing the learning signal of the student model by comparing the teacher model representations of different samples without additional calls to the teacher model. Experiments demonstrate that CKD significantly outperforms existing data augmentation and knowledge distillation techniques in various experimental settings, even improving accuracy by more than 7% under certain resource-constrained conditions. Additionally, CKD can be combined with intermediate layer loss functions to further enhance performance.