MIM-HD: Making Smaller Masked Autoencoder Better with Efficient Distillation

Zherui Zhang,Changwei Wang,Rongtao Xu,Wenhao Xu,Shibiao Xu,Li Guo,Jiguang Zhang,Xiaoqiang Teng,Wenbo Xu
DOI: https://doi.org/10.3233/faia240493
2024-01-01
Abstract:Self-supervised learning and knowledge distillation intersect to achieve exceptional performance on downstream tasks across diverse network capacities. This paper introduces MIM-HD, which implements enhancements for masked image modeling (MIM) distillation, in two key aspects. First, a vision transformer head-level relation adaptive distillation approach is proposed, allowing the student to dynamically draw multi-source knowledge from the teacher based on its evolving state, compatible with scenarios where teacher-student transformer block head count differs. Second, to address the overemphasis on the encoder and neglect of the decoder role in maintaining representation consistency in previous MIM distillations, a dual-view decoding strategy for latent visual representations is introduced, reusing the teacher’s decoder to alleviate MIM burdens on smaller networks. MIM-HD effectiveness is demonstrated through evaluations on ADE20K (mIoU) and ImageNet-1K (Acc), achieving +1.4% and +0.5% improved performance, respectively, compared to state-of-the-art methods, with substantial advantages on smaller pre-training datasets. Moreover, MIM-HD achieves superior efficiency, reducing pre-training epochs from 300 to 100.
What problem does this paper attempt to address?