Obtain Dark Knowledge via Extended Knowledge Distillation.

Chen Yuan,Rong Pan
DOI: https://doi.org/10.1109/AIAM48774.2019.00106
2019-01-01
Abstract:Training a smaller student model on portable devices such as smart phones to mimic a complex and heavy teacher model through knowledge distillation has received extensive attention. Massive studies have been done on the application of knowledge distillation on various tasks as well as improvement of supervised information while training student model. However, few studies focus on the distillation of "dark knowledge" can be found in teacher model but hard to be expressed directly, which is very important because the training data used to train the teacher model are not always visible to the student model. We extended the method of knowledge distillation in this paper, not only taking the difference of logits between the teacher model and the student model as part of loss function as the basic knowledge distillation method did, but also paying attention to the interior of both models. We divided both teacher model and student model into several segments and made the outputs of these segments as close as possible to form another part of loss function, and this method was referred to as "Extended-KD" (Extended Knowledge Distillation). In our experiment, we used complete CIFAR-10 dataset to train student model as baseline, and then we tried to drop all examples of some labels to train student model through Extended-KD. Our experiment shows that Extended-KD method performs better than the basic knowledge distillation method; and knowledge distillation with incomplete datasets can also enable student model to predict the target labels it has never seen. Therefore, Extended-KD method can obtain dark knowledge properly.
What problem does this paper attempt to address?