Generalizing Teacher Networks for Effective Knowledge Distillation Across Student Architectures

Kuluhan Binici,Weiming Wu,Tulika Mitra
2024-07-23
Abstract:Knowledge distillation (KD) is a model compression method that entails training a compact student model to emulate the performance of a more complex teacher model. However, the architectural capacity gap between the two models limits the effectiveness of knowledge transfer. Addressing this issue, previous works focused on customizing teacher-student pairs to improve compatibility, a computationally expensive process that needs to be repeated every time either model changes. Hence, these methods are impractical when a teacher model has to be compressed into different student models for deployment on multiple hardware devices with distinct resource constraints. In this work, we propose Generic Teacher Network (GTN), a one-off KD-aware training to create a generic teacher capable of effectively transferring knowledge to any student model sampled from a given finite pool of architectures. To this end, we represent the student pool as a weight-sharing supernet and condition our generic teacher to align with the capacities of various student architectures sampled from this supernet. Experimental evaluation shows that our method both improves overall KD effectiveness and amortizes the minimal additional training cost of the generic teacher across students in the pool.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of architectural capacity gap between teacher models and student models in the process of Knowledge Distillation (KD). Specifically: 1. **Capacity Gap Issue**: - In traditional knowledge distillation methods, the significant architectural capacity difference between teacher models and student models limits the effectiveness of knowledge transfer. This makes it difficult for certain neural networks to benefit effectively from knowledge distillation. 2. **Limitations of Customized Teacher Models**: - Previous works have attempted to improve the knowledge transfer effect for specific student models by customizing teacher models. However, this approach requires retraining for each student model, leading to high time costs, especially when deployment on various hardware devices is needed. 3. **Multi-Platform Deployment Requirements**: - In practical applications, different hardware devices have different resource constraints, necessitating neural network models of varying sizes and complexities. Choosing the appropriate model is very challenging because high-performance models often consume significant resources, while resource-efficient models lack sufficient performance. To address these issues, the paper proposes the Generic Teacher Network (GTN), a one-time training method that creates a universal teacher model capable of effectively transferring knowledge to any student model drawn from a given limited architecture pool. Experimental results show that this method not only improves the overall effectiveness of knowledge distillation but also distributes the additional time cost of training the universal teacher model across multiple student models. Furthermore, this method performs well in Neural Architecture Search (NAS) scenarios, enhancing the performance of various customized student models.