Faculty Distillation with Optimal Transport

Su Lu,Han-Jia Ye,De-chuan Zhan
DOI: https://doi.org/10.48550/arXiv.2204.11526
2022-01-01
Abstract:Knowledge distillation (KD) has shown its effec-tiveness in improving a student classifier given a suitable teacher. The outpouring of diverse and plentiful pre-trained models may provide abundant teacher resources for KD. However, these models are often trained on different tasks from the student, which requires the student to precisely select the most contributive teacher and enable KD across different label spaces. These restric-tions disclose the insufficiency of standard KD and motivate us to study a new paradigm called faculty distillation. Given a group of teachers (fac-ulty), a student needs to select the most relevant teacher and perform generalized knowledge reuse. To this end, we propose to link teacher’s task and student’s task by optimal transport. Based on the semantic relationship between their label spaces, we can bridge the support gap between output distributions by minimizing Sinkhorn distances. The transportation cost also acts as a measurement of teachers’ adaptability so that we can rank the teachers efficiently according to their relatedness. Experiments under various settings demonstrate the succinctness and versatility of our method.
What problem does this paper attempt to address?