Abstract:In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation (KD) research, in the context of using large-scale datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the above problems, we present a simple and effective KD method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets, achieving state-of-the-art distillation performance. For instance, taking a well pre-trained Swin-L as the teacher model, our method gets 75.15%|82.03%|84.16%|78.63%|81.96%|83.93%|83.80%|85.53% top-1 accuracies for MobileNet-V1|ResNet-50|ConvNeXt-T|Mixer-S/16|Mixer-B/16|ViT-S/16|Swin-T|ViT-B/16 models trained on ImageNet-1K dataset from scratch, showing 3.05%|3.39%|2.02%|4.61%|5.52%|4.03%|2.62%|3.73% absolute gains to the individually trained counterparts. Intriguingly, when scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties, bringing increasingly larger gains to student models. The student backbones trained by our method transfer well on downstream MS-COCO and ADE20K datasets. More importantly, our method could be used as a more efficient alternative to the time-intensive pre-training paradigm for any target student model if a strong pre-trained ViT is available, reducing the amount of viewed training samples up to 195x.

Improved Vit via knowledge distallation on small datasets

Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation

Training data-efficient image transformers & distillation through attention

OVO: One-shot Vision Transformer Search with Online distillation

Improving Vision Transformers for Incremental Learning

Add-Vit: CNN-Transformer Hybrid Architecture for Small Data Paradigm Processing

ViTKD: Feature-based Knowledge Distillation for Vision Transformers

ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

Bi-ViT: Pushing the Limit of Vision Transformer Quantization

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Patient teacher can impart locality to improve lightweight vision transformer on small dataset

Boosting Vanilla Lightweight Vision Transformers Via Re-parameterization

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

AMD: Automatic Multi-step Distillation of Large-scale Vision Models

Knowledge Distillation via the Target-aware Transformer

Super Vision Transformer

Co-advise: Cross Inductive Bias Distillation

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

ViR:the Vision Reservoir