Using Less but Important Information for Feature Distillation

Xiang Wen,Yanming Chen,Li Liu,Choonghyun Lee,Yi Zhao,Yong Gong
DOI: https://doi.org/10.1007/978-981-99-8079-6_31
2024-01-01
Abstract:The purpose of feature distillation is that using the teacher network to supervise student network so that the student network can mimic the intermediate layer representation of the teacher network. The most intuitive way of feature distillation is to use the Mean-Square Error (MSE) to optimize the distance of feature representation at the same level for both networks. However, one problem in feature distillation is that the dimension of the intermediate layer feature maps of the student network may be different from that of the teacher network. Previous work mostly elaborated a projector to transform feature maps to the same dimension. In this paper, we proposed a simple and straightforward feature distillation method without additional projector to adapt the feature dimension inconsistency between the teacher and the student networks. We consider the redundancy of the data and show that it is not necessary to use all the information when performing feature distillation. In detail, we propose a cut-off operation for channel alignment and use singular value decomposition (SVD) for knowledge alignment so that only important information is transferred to the student network to solve the dimension inconsistency problem. Extensive experiments on several different models show that our method can improve the performance of student networks.
What problem does this paper attempt to address?