Knowledge Distillation with a Precise Teacher and Prediction with Abstention

Yi Xu,Jian Pu,Hui Zhao
DOI: https://doi.org/10.1109/icpr48806.2021.9412696
2021-01-01
Abstract:Knowledge distillation, which aims to train model under the supervision from another large model (teacher model) to the original model (student model), has achieved remarkable results in supervised learning. However, there are two major problems with existing knowledge distillation methods. One is the teacher's supervision is sometimes misleading, and the other is the student's prediction is not accurate enough. To address the first issue, instead of learning a combination of both teachers and ground truth, we apply knowledge adjustment to correct teachers' supervision using ground truth. For the second problem, we use the selective classification framework to train the student model. In particular, the deep gambler loss is adopted to predict with reservation by explicitly introducing the ( $m+1$ )-th class. We consider two settings of knowledge distillation: (1) distillation across different network structures (AlexNet, ResNet), and (2) distillation across networks with different depths (ResNet18, ResNet5 0) to evaluate the effectiveness of our method. The experimental results on benchmark datasets (i.e., Fashion-MNIST, SVHN, CIFAR10, CIFAR100) are reported with higher prediction accuracies and lower coverage errors.
What problem does this paper attempt to address?