Bridging the Gap of Dimensions in Distillation: Understanding the knowledge transfer between different-dimensional semantic spaces

Zhiyuan Ma,Ziyue Song,Haodong Zhao,Kui Meng,Gongshen Liu
DOI: https://doi.org/10.1109/IJCNN52387.2021.9534452
2021-01-01
Abstract:In recent years, knowledge distillation has been widely used in the field of deep learning in order to reduce the model size and save time and space. The student-teacher paradigm is a framework for knowledge distillation, and knowledge distillation proposed to minimize the KL divergence between the probabilistic outputs of a teacher and student network. However, apart from the probabilistic outputs, there are much valuable information contained in the middle layers of the teacher network. As for NLP tasks, the hidden vectors from different layers of a model have different semantic information, but the vectors' dimension of the student network is different from that of the teacher network in many cases, which makes hidden layer distillation hard to be performed directly. We propose to simply use a transition matrix to project the student's vector to a space of the same dimension as the teacher's vector, and we theoretically prove the effectiveness of this method. Our analysis shows how the transition matrix preserve important semantic information, which is closely related to the vector's characteristic in Euclidean space. We provide a geometric method for the interpretability of shared knowledge space for student-teacher architectures. Our experiments show that this method can significantly improve the performance of a small model in different tasks with different models.
What problem does this paper attempt to address?