Abstract:Transformer-based architectures have become the de-facto standard models for diverse vision tasks owing to their superior performance. As the size of the models continues to scale up, model distillation becomes extremely important in various real applications, particularly on devices limited by computational resources. However, prevailing knowledge distillation methods exhibit diminished efficacy when confronted with a large capacity gap between the teacher and the student, e.g, 10x compression rate. In this paper, we present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression. In particular, our distillation process unfolds across multiple steps. Initially, the teacher undergoes distillation to form an intermediate teacher-assistant model, which is subsequently distilled further to the student. An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance. We conduct extensive experiments on multiple image classification datasets, including CIFAR-10, CIFAR-100, and ImageNet. The findings consistently reveal that our approach outperforms several established baselines, paving a path for future knowledge distillation methods on large-scale vision models.

What problem does this paper attempt to address?

The paper attempts to address the issue in large-scale visual model compression where traditional knowledge distillation methods perform poorly when there is a significant capacity gap between the teacher model and the student model. Specifically, the paper proposes a new method called Automatic Multi-step Distillation (AMD), which aims to effectively compress large visual models through a multi-step distillation process, especially for applications on devices with limited computational resources. ### Main Issues: 1. **Knowledge Transfer Challenge with Large Capacity Gap**: When there is a very large capacity gap between the teacher model and the student model (e.g., 10x compression rate), the performance of existing knowledge distillation methods significantly declines. 2. **Efficient Selection of Intermediate Teacher Assistants**: In the multi-step distillation process, how to efficiently select the optimal intermediate teacher assistant to maximize the performance of the student model. ### Solution: - **Automatic Multi-step Distillation (AMD)**: This method involves a multi-step distillation process where the teacher model is first distilled into an intermediate teacher assistant model, which is then further distilled into the student model. The entire process includes three stages: 1. **Structural Pruning**: Generate a series of teacher assistant candidate models of different scales. 2. **Joint Optimization**: Efficiently find the optimal teacher assistant at all scales through a parameter-sharing and joint optimization framework. 3. **Optimal Selection**: Select the optimal teacher assistant from all candidates based on the Negative Performance-Scale Derivative (NPSD) metric. ### Experimental Results: - The paper conducts extensive experiments on multiple image classification datasets (such as CIFAR-10, CIFAR-100, and ImageNet), showing that the AMD method significantly outperforms existing single-step and multi-step knowledge distillation methods in terms of performance. - Specifically, in terms of Top-1 accuracy, the AMD method achieves improvements of 1.79%-15.91%, 1.81%-24.25%, and 2.77%-4.03% on CIFAR-10, CIFAR-100, and ImageNet, respectively. - Additionally, the training speed of the AMD method is much faster than other multi-step distillation baseline methods, with training speed improvements of up to 10x. ### Significance: - This research provides a new effective method for knowledge distillation of large-scale visual models, which is particularly significant for deploying these models on devices with limited computational resources. - Through automatic multi-step distillation, it is possible to significantly reduce the computational complexity and memory footprint of the model while maintaining high performance, thereby promoting the widespread adoption of large-scale visual models in practical applications.

AMD: Automatic Multi-step Distillation of Large-scale Vision Models

DCCD: Reducing Neural Network Redundancy Via Distillation

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Self-Paced Knowledge Distillation for Real-Time Image Guided Depth Completion

Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

RSKD: Enhanced medical image segmentation via multi-layer, rank-sensitive knowledge distillation in Vision Transformer models

Improved Knowledge Distillation via Teacher Assistant

AMD: Adaptive Masked Distillation for Object Detection

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

ReDistill: Residual Encoded Distillation for Peak Memory Reduction

ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

ATOM: Attention Mixer for Efficient Dataset Distillation

Pixel Distillation: Cost-flexible Distillation Across Image Sizes and Heterogeneous Networks

On Good Practices for Task-Specific Distillation of Large Pretrained Visual Models

DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Optimizing Vision Transformers with Data-Free Knowledge Transfer

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

When in Doubt, Summon the Titans: Efficient Inference with Large Models