Generic-to-Specific Distillation of Masked Autoencoders
Wei Huang,Zhiliang Peng,Li Dong,Furu Wei,Qixiang Ye,Jianbin Jiao
DOI: https://doi.org/10.1109/tcsvt.2024.3393474
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:To transfer the representation capacity of large pre-trained models to lightweight models, knowledge distillation has been widely explored. However, conventional single-stage distillation methods are prone to getting stuck in the transfer of task-specific knowledge, making it difficult to retain task-agnostic knowledge which is crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to boost lightweight models under the assistance of large models pre-trained by masked image modeling. In generic distillation, the decoder of a small model is encouraged to align feature predictions with that of a large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are encouraged to be consistent with those of the large model, to guarantee task performance. G2SD is also applicable for heterogeneous settings(i.e., distilling from ViT to CNN). With G2SD, the ViT-Small model respectively achieves 98.9%, 98.4%, 99.3% and 98.9% accuracies when compared with its teachers (ViT-Base) for image classification, object detection, semantic segmentation and video recognition tasks. The lightweight ResNet models are improved to a new height on image classification task. The code is available at github.com/pengzhiliang/G2SD.
engineering, electrical & electronic