Adaptive Teaching with Shared Classifier for Knowledge Distillation

Jaeyeon Jang,Young-Ik Kim,Jisu Lim,Hyeonseong Lee
2024-06-14
Abstract:Knowledge distillation (KD) is a technique used to transfer knowledge from an overparameterized teacher network to a less-parameterized student network, thereby minimizing the incurred performance loss. KD methods can be categorized into offline and online approaches. Offline KD leverages a powerful pretrained teacher network, while online KD allows the teacher network to be adjusted dynamically to enhance the learning effectiveness of the student network. Recently, it has been discovered that sharing the classifier of the teacher network can significantly boost the performance of the student network with only a minimal increase in the number of network parameters. Building on these insights, we propose adaptive teaching with a shared classifier (ATSC). In ATSC, the pretrained teacher network self-adjusts to better align with the learning needs of the student network based on its capabilities, and the student network benefits from the shared classifier, enhancing its performance. Additionally, we extend ATSC to environments with multiple teachers. We conduct extensive experiments, demonstrating the effectiveness of the proposed KD method. Our approach achieves state-of-the-art results on the CIFAR-100 and ImageNet datasets in both single-teacher and multiteacher scenarios, with only a modest increase in the number of required model parameters. The source code is publicly available at <a class="link-external link-https" href="https://github.com/random2314235/ATSC" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to more effectively transfer knowledge from a teacher network with a large number of parameters to a student network with a smaller number of parameters during the knowledge distillation (KD) process while minimizing performance loss. Specifically, the paper proposes a new method named Adaptive Teaching with Shared Classifier (ATSC), aiming to improve the existing technology through the following points: 1. **Combining pre - trained large - scale teacher networks**: Utilize large - scale pre - trained teacher networks that already have strong discrimination capabilities as knowledge sources. 2. **Adaptive teaching**: Allow the teacher network to dynamically adjust its own parameters according to the learning needs of the student network to better support the student's training process. 3. **Shared classifier**: The student network can directly use the classifier of the teacher network, which only requires adding a small number of parameters, thereby significantly enhancing the student's discrimination ability. ### Main contributions of the paper - **Effectively integrating three key components in knowledge distillation technology for the first time**: Powerful pre - trained teacher networks, knowledge distillation methods based on adaptive teaching, and shared classifiers. - **Introducing the concept of adaptive teaching**: The teacher model sacrifices part of its own discrimination ability to more effectively help the student model learn representations. Experimental results show that this slight decrease in the teacher's discrimination ability can lead to a significant improvement in the performance of the student model. - **Achieving state - of - the - art performance on the CIFAR - 100 and ImageNet datasets**: - In the single - teacher setting, ATSC improves the accuracy rate of the baseline student network (without KD) by 5.30% on the CIFAR - 100 dataset and by 6.70% in the multi - teacher setting. - On the more challenging ImageNet dataset, ATSC improves the accuracy rate of the student model (ResNet - 18) by 1.19% while achieving the fastest training convergence speed. - **Demonstrating the robustness of ATSC under different balance parameter settings**, reducing the effort required for hyper - parameter optimization. ### Experimental verification The paper verifies the effectiveness of ATSC through extensive experiments, including comparisons with a variety of state - of - the - art offline and online knowledge distillation methods. The experimental results show that ATSC performs well in various settings, especially in multi - teacher scenarios. ### Method overview - **Background: Knowledge distillation reusing teacher classifier**: By introducing a projector, align the feature dimensions of the teacher and student networks, thereby reducing performance loss. - **Adaptive teaching and shared classifier**: Through a two - step optimization process, first optimize the encoders of the teacher and student networks, and then update the shared classifier to maximize the retention of the classifier's discrimination ability. - **Extension to multi - teacher models**: In the multi - teacher scenario, the student network learns from the average adjusted representations of multiple teachers and makes predictions through the optimized projector and shared classifier. ### Conclusion The method ATSC proposed in the paper significantly improves the performance of the student network while maintaining a low increase in parameters by combining pre - trained large - scale teacher networks, adaptive teaching, and shared classifiers. These improvements not only perform excellently on small - scale datasets but also show their advantages on large - scale datasets.