Abstract:In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation (KD) research, in the context of using large-scale datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the above problems, we present a simple and effective KD method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets, achieving state-of-the-art distillation performance. For instance, taking a well pre-trained Swin-L as the teacher model, our method gets 75.15%|82.03%|84.16%|78.63%|81.96%|83.93%|83.80%|85.53% top-1 accuracies for MobileNet-V1|ResNet-50|ConvNeXt-T|Mixer-S/16|Mixer-B/16|ViT-S/16|Swin-T|ViT-B/16 models trained on ImageNet-1K dataset from scratch, showing 3.05%|3.39%|2.02%|4.61%|5.52%|4.03%|2.62%|3.73% absolute gains to the individually trained counterparts. Intriguingly, when scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties, bringing increasingly larger gains to student models. The student backbones trained by our method transfer well on downstream MS-COCO and ADE20K datasets. More importantly, our method could be used as a more efficient alternative to the time-intensive pre-training paradigm for any target student model if a strong pre-trained ViT is available, reducing the amount of viewed training samples up to 195x.

Knowledge distilled pre-training model for vision-language-navigation

Real-time Vision-Language-Navigation based on a Lite Pre-training Model

MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation

Research on Knowledge Distillation Algorithm of Object Detection

Layerwised multimodal knowledge distillation for vision-language pretrained model

Knowledge Distillation of Transformer-based Language Models Revisited

Enabling Vision-and-Language Navigation for Intelligent Connected Vehicles Using Large Pre-Trained Models

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Depth-Aware Vision-and-Language Navigation Using Scene Query Attention Network

Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models

Vision-and-Language Navigation Generative Pretrained Transformer

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition

Online Robot Navigation and Manipulation with Distilled Vision-Language Models

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Lightweight Model Pre-training via Language Guided Knowledge Distillation

Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

DLIP: Distilling Language-Image Pre-training

MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation for Effective-and-Efficient Vision-and-Language Navigation