Adaptive Cross-Architecture Mutual Knowledge Distillation

Jianyuan Ni,Hao Tang,Yuzhang Shang,Bin Duan,Yan
DOI: https://doi.org/10.1109/fg59268.2024.10581969
2024-01-01
Abstract:Knowledge distillation (KD), which distills knowledge from complex networks (teacher) to lightweight (student) networks, has been actively studied recently. Despite previous studies have proposed several advanced KD losses or intricate training strategies, the core concept of KD proves ineffective if the student model is too weak to mimic the teacher's performance. In this study, we aim to narrow the performance discrepancy between Transformer-based teacher and student models by incorporating the inductive biases of several heterogeneous student models. To this end, we put forward a novel cross-architecture knowledge distillation approach called Adaptive Cross-architecture Mutual Knowledge Distillation (ACMKD), which tries to mitigate the performance gap issue using a multi-students mutual learning strategy. Specifically, we utilize three mainstream models associated with various inductive biases (CNN, INN, and Transformer) as the student models. In addition, we propose an effective attention similarity mechanism to facilitate the student models in mimicking specific portions of the teacher model. Drawing inspiration from the Cannikin Law, we devise a unique second-stage KD process that dynamically enables the weakest student model to learn from other stronger student models again. We validate our proposed methods on ImageNet and CIFAR100 datasets, and the results confirm that our ACMKD method significantly narrows the performance gap compared to other KD methods.
What problem does this paper attempt to address?