Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures

Kuluhan Binici,Weiming Wu,Tulika Mitra

2024-07-23

Abstract:Knowledge distillation (KD) is a model compression method that entails training a compact student model to emulate the performance of a more complex teacher model. However, the architectural capacity gap between the two models limits the effectiveness of knowledge transfer. Addressing this issue, previous works focused on customizing teacher-student pairs to improve compatibility, a computationally expensive process that needs to be repeated every time either model changes. Hence, these methods are impractical when a teacher model has to be compressed into different student models for deployment on multiple hardware devices with distinct resource constraints. In this work, we propose Generic Teacher Network (GTN), a one-off KD-aware training to create a generic teacher capable of effectively transferring knowledge to any student model sampled from a given finite pool of architectures. To this end, we represent the student pool as a weight-sharing supernet and condition our generic teacher to align with the capacities of various student architectures sampled from this supernet. Experimental evaluation shows that our method both improves overall KD effectiveness and amortizes the minimal additional training cost of the generic teacher across students in the pool.

Machine Learning,Artificial Intelligence

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of architectural capacity gap between teacher models and student models in the process of Knowledge Distillation (KD). Specifically: 1. **Capacity Gap Issue**: - In traditional knowledge distillation methods, the significant architectural capacity difference between teacher models and student models limits the effectiveness of knowledge transfer. This makes it difficult for certain neural networks to benefit effectively from knowledge distillation. 2. **Limitations of Customized Teacher Models**: - Previous works have attempted to improve the knowledge transfer effect for specific student models by customizing teacher models. However, this approach requires retraining for each student model, leading to high time costs, especially when deployment on various hardware devices is needed. 3. **Multi-Platform Deployment Requirements**: - In practical applications, different hardware devices have different resource constraints, necessitating neural network models of varying sizes and complexities. Choosing the appropriate model is very challenging because high-performance models often consume significant resources, while resource-efficient models lack sufficient performance. To address these issues, the paper proposes the Generic Teacher Network (GTN), a one-time training method that creates a universal teacher model capable of effectively transferring knowledge to any student model drawn from a given limited architecture pool. Experimental results show that this method not only improves the overall effectiveness of knowledge distillation but also distributes the additional time cost of training the universal teacher model across multiple student models. Furthermore, this method performs well in Neural Architecture Search (NAS) scenarios, enhancing the performance of various customized student models.

Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures

Adaptive Cross-Architecture Mutual Knowledge Distillation

Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture

Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation

An Embarrassingly Simple Approach for Knowledge Distillation

Homogeneous teacher based buffer knowledge distillation for tiny neural networks

Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

Multiple-Stage Knowledge Distillation

Improving Knowledge Distillation With a Customized Teacher

Comparative Knowledge Distillation

TAS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant

Collaborative Knowledge Distillation Via Multiknowledge Transfer.

Relay Knowledge Distillation for Efficiently Boosting the Performance of Shallow Networks

Knowledge Distillation via Token-Level Relationship Graph Based on the Big Data Technologies

Frameless Graph Knowledge Distillation

Teacher-Student Architecture for Knowledge Distillation: A Survey

BD-KD: Balancing the Divergences for Online Knowledge Distillation

Fine-Grained Learning Behavior-Oriented Knowledge Distillation for Graph Neural Networks

Collaborative Knowledge Distillation

Learning from a Lightweight Teacher for Efficient Knowledge Distillation

Annealing Knowledge Distillation