Homogeneous teacher based buffer knowledge distillation for tiny neural networks

Xinru Dai,Gang Lu,Jianhua Shen,Shuo Huang,Tongquan Wei
DOI: https://doi.org/10.1016/j.sysarc.2024.103078
IF: 5.836
2024-01-30
Journal of Systems Architecture
Abstract:Knowledge Distillation (KD) has shown great promise in improving the performance of tiny neural networks. Most existing KD methods have the large teacher-student discrepancy, thus, students hardly learn useful knowledge and may not achieve effective distillation. In this paper, we focus on the construction and training of homogeneous teachers and propose the novel Buffer Knowledge Distillation (BKD), which reduces the teacher-student discrepancy in terms of network architecture and distilled knowledge. Particularly, we first construct a series of homogeneous networks with larger capacity based on the student. A width-by-width fine-tuning mechanism is developed to reduce training costs of homogeneous networks, and the one with the highest accuracy is selected as the teacher. Furthermore, we propose BKD to reduce the learning difficulty, in which teacher and student features are fused into buffer features by our new multi-scale feature fusion module. Extensive experiments for image classification have been conducted to verify the homogeneous teacher based BKD, which consistently outperforms many existing KD methods. The results show that our method achieves up to 4.75% accuracy improvement on CIFAR-100, and the width-by-width fine-tuning mechanism incurs 33.58% and 36.08% less training time on CIFAR-100 and CIFAR-10, respectively.
computer science, software engineering, hardware & architecture
What problem does this paper attempt to address?