SAKD: Sparse attention knowledge distillation
Zhen Guo,Pengzhou Zhang,Peng Liang
DOI: https://doi.org/10.1016/j.imavis.2024.105020
IF: 3.86
2024-04-18
Image and Vision Computing
Abstract:Deep learning techniques have gained significant interest due to their success in large model scenarios. However, large models often require massive computational resources, which can challenge end devices with limited storage capabilities. Transferring knowledge from big to small models and achieving similar results with limited resources requires further research. Knowledge distillation techniques, which involve using teacher-student models to migrate large model capabilities to small models, have been widely used in model compression and knowledge transfer. In this paper, a novel knowledge distillation approach is proposed, which utilizes the sparse attention mechanism (SAKD). SAKD computes attention using student features as queries and teacher features as key values and performs sparse attention values by random deactivation. Then, this sparse attention value is used to reweight the feature distance of each teacher-student feature pair to avoid negative transfer. Comprehensive experiments demonstrate the effectiveness and generality of our approach. Moreover, our SAKD method outperforms previous state-of-the-art methods on image classification tasks.
computer science, artificial intelligence, theory & methods,engineering, electrical & electronic, software engineering,optics