ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

Jiawei Fan,Chao Li,Xiaolong Liu,Anbang Yao
2024-11-11
Abstract:In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation (KD) research, in the context of using large-scale datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the above problems, we present a simple and effective KD method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets, achieving state-of-the-art distillation performance. For instance, taking a well pre-trained Swin-L as the teacher model, our method gets 75.15%|82.03%|84.16%|78.63%|81.96%|83.93%|83.80%|85.53% top-1 accuracies for MobileNet-V1|ResNet-50|ConvNeXt-T|Mixer-S/16|Mixer-B/16|ViT-S/16|Swin-T|ViT-B/16 models trained on ImageNet-1K dataset from scratch, showing 3.05%|3.39%|2.02%|4.61%|5.52%|4.03%|2.62%|3.73% absolute gains to the individually trained counterparts. Intriguingly, when scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties, bringing increasingly larger gains to student models. The student backbones trained by our method transfer well on downstream MS-COCO and ADE20K datasets. More importantly, our method could be used as a more efficient alternative to the time-intensive pre-training paradigm for any target student model if a strong pre-trained ViT is available, reducing the amount of viewed training samples up to 195x.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to use a well - pre - trained Vision Transformer (ViT) model as a teacher model to effectively transfer the scalable characteristics of these large - scale ViT models to student models with different architectures (such as CNN, MLP, and heterogeneous ViT structures), so as to achieve better performance in cross - architecture knowledge distillation. Specifically, the author focuses on the following three aspects of differences: 1. **Differences in feature calculation paradigms**: - The ViT model operates on a sequence of equally - sized image patches with position embeddings, while the CNN operates on a regular pixel grid. - The ViT relies on the self - attention mechanism to model global feature dependencies, while the CNN relies on convolution operations to model local features. - The MLP uses fully - connected operations instead of self - attention and does not use position embeddings, resulting in its relatively weak feature - learning ability. 2. **Differences in model scales**: - At the micro - level, the network widths, depths, and building blocks of ViT, CNN, and MLP are different. - At the macro - level, different architectures have different abilities in expanding the model scale to improve performance and generalization ability. 3. **Differences in knowledge densities**: - The larger the pre - training data set, the better the performance of large - scale ViT models is usually better than that of top - level CNN and MLP models. - When only using a well - pre - trained ViT teacher model, the knowledge density of the student model on the upstream image classification data set (such as ImageNet - 1K) is different from that of the teacher model. To solve these problems, the author proposes a simple and effective cross - architecture knowledge distillation method named ScaleKD, which aligns the above - mentioned differences through three closely - coupled components: 1. **Cross - Attention Projector (CAP)**: - Convert the semantic units of CNN and MLP into ViT - like tokens. - Use cross - attention operations and trainable queries to model the global dependencies of student features. 2. **Dual - view Feature Mimicking (DFM)**: - Perform feature mimicking in the original feature space and the frequency space respectively to make up for the alternative features ignored by the existing KD methods. 3. **Teacher Parameter Perception (TPP)**: - Establish a proxy feature processing path by connecting the early stage of the student model and the later stage of the teacher model, so that the parameter space of the student gradually aligns with that of the teacher. Through these designs, ScaleKD can effectively handle the problem of cross - architecture knowledge distillation and shows significant performance improvement. Experimental results show that ScaleKD can not only help students inherit the scalability of the teacher, but also enable the student model to achieve performance comparable to that of the pre - trained model without using pre - training data, thus providing a more efficient alternative. ### Formula summary - **Loss function of CAP**: \[ L_{\text{CAP}}=\alpha \| F_t - f_p(F_s; q) \|_2^2 \] where \( f_p \) is CAP, \( q \) is a trainable query, \( \alpha\geq0 \) is the loss weight, and \( L(\cdot) \) is the L2 - norm distance. - **Loss function of DFM**: \[ L_{\text{DFM}}=\beta L_{\text{ori}}+(1 - \beta) L_{\text{alt}} \] where \( \beta\in[0,1] \) is the balance weight. - **Loss function of TPP**: \[ L_{\text{TPP}}=L_s + L_{st} \] - **Total loss function of ScaleKD**: \[ L_{\text{ScaleKD}}