Improved Knowledge Distillation via Full Kernel Matrix Transfer

Qi Qian,Hao Li,Juhua Hu
DOI: https://doi.org/10.48550/arXiv.2009.14416
2022-03-30
Abstract:Knowledge distillation is an effective way for model compression in deep learning. Given a large model (i.e., teacher model), it aims to improve the performance of a compact model (i.e., student model) by transferring the information from the teacher. Various information for distillation has been studied. Recently, a number of works propose to transfer the pairwise similarity between examples to distill relative information. However, most of efforts are devoted to developing different similarity measurements, while only a small matrix consisting of examples within a mini-batch is transferred at each iteration that can be inefficient for optimizing the pairwise similarity over the whole data set. In this work, we aim to transfer the full similarity matrix effectively. The main challenge is from the size of the full matrix that is quadratic to the number of examples. To address the challenge, we decompose the original full matrix with Nystr{ö}m method. By selecting appropriate landmark points, our theoretical analysis indicates that the loss for transfer can be further simplified. Concretely, we find that the difference between the original full kernel matrices between teacher and student can be well bounded by that of the corresponding partial matrices, which only consists of similarities between original examples and landmark points. Compared with the full matrix, the size of the partial matrix is linear in the number of examples, which improves the efficiency of optimization significantly. The empirical study on benchmark data sets demonstrates the effectiveness of the proposed algorithm. Code is available at \url{<a class="link-external link-https" href="https://github.com/idstcv/KDA" rel="external noopener nofollow">this https URL</a>}.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently transmit all the similarity matrices of the teacher model to the student model during the knowledge distillation process. Traditional knowledge distillation methods mainly focus on transmitting part of the similarity matrices within each mini - batch, which is less efficient in optimizing the pairwise similarity of the entire data set. This paper proposes a new method. By using the Nyström method to perform low - rank approximation on the full similarity matrix and selecting appropriate landmark points, the full similarity matrix can be effectively transmitted and the performance of the student model can be improved. Specifically, the paper addresses the following challenges: 1. **Efficiency problem of full - matrix transmission**: Since the size of the full similarity matrix is proportional to the square of the number of samples, directly transmitting the full matrix is not feasible in practical applications. In this paper, the full matrix is approximated with low - rank by the Nyström method, which significantly reduces the amount of data to be transmitted. 2. **Selection of landmark points**: The paper proposes using class centers as landmark points, which not only improves the accuracy of the approximation but also ensures the efficiency of the optimization process. 3. **Theoretical guarantee**: The paper provides a theoretical analysis, proving that by optimizing the differences between landmark points, the differences between full matrices can be effectively approximated, thus providing a new perspective for knowledge distillation. Through these improvements, the method proposed in the paper can significantly improve the performance of the student model without incurring additional losses.