AMD: Automatic Multi-step Distillation of Large-scale Vision Models

Cheng Han,Qifan Wang,Sohail A. Dianat,Majid Rabbani,Raghuveer M. Rao,Yi Fang,Qiang Guan,Lifu Huang,Dongfang Liu
2024-07-05
Abstract:Transformer-based architectures have become the de-facto standard models for diverse vision tasks owing to their superior performance. As the size of the models continues to scale up, model distillation becomes extremely important in various real applications, particularly on devices limited by computational resources. However, prevailing knowledge distillation methods exhibit diminished efficacy when confronted with a large capacity gap between the teacher and the student, e.g, 10x compression rate. In this paper, we present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression. In particular, our distillation process unfolds across multiple steps. Initially, the teacher undergoes distillation to form an intermediate teacher-assistant model, which is subsequently distilled further to the student. An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance. We conduct extensive experiments on multiple image classification datasets, including CIFAR-10, CIFAR-100, and ImageNet. The findings consistently reveal that our approach outperforms several established baselines, paving a path for future knowledge distillation methods on large-scale vision models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue in large-scale visual model compression where traditional knowledge distillation methods perform poorly when there is a significant capacity gap between the teacher model and the student model. Specifically, the paper proposes a new method called Automatic Multi-step Distillation (AMD), which aims to effectively compress large visual models through a multi-step distillation process, especially for applications on devices with limited computational resources. ### Main Issues: 1. **Knowledge Transfer Challenge with Large Capacity Gap**: When there is a very large capacity gap between the teacher model and the student model (e.g., 10x compression rate), the performance of existing knowledge distillation methods significantly declines. 2. **Efficient Selection of Intermediate Teacher Assistants**: In the multi-step distillation process, how to efficiently select the optimal intermediate teacher assistant to maximize the performance of the student model. ### Solution: - **Automatic Multi-step Distillation (AMD)**: This method involves a multi-step distillation process where the teacher model is first distilled into an intermediate teacher assistant model, which is then further distilled into the student model. The entire process includes three stages: 1. **Structural Pruning**: Generate a series of teacher assistant candidate models of different scales. 2. **Joint Optimization**: Efficiently find the optimal teacher assistant at all scales through a parameter-sharing and joint optimization framework. 3. **Optimal Selection**: Select the optimal teacher assistant from all candidates based on the Negative Performance-Scale Derivative (NPSD) metric. ### Experimental Results: - The paper conducts extensive experiments on multiple image classification datasets (such as CIFAR-10, CIFAR-100, and ImageNet), showing that the AMD method significantly outperforms existing single-step and multi-step knowledge distillation methods in terms of performance. - Specifically, in terms of Top-1 accuracy, the AMD method achieves improvements of 1.79%-15.91%, 1.81%-24.25%, and 2.77%-4.03% on CIFAR-10, CIFAR-100, and ImageNet, respectively. - Additionally, the training speed of the AMD method is much faster than other multi-step distillation baseline methods, with training speed improvements of up to 10x. ### Significance: - This research provides a new effective method for knowledge distillation of large-scale visual models, which is particularly significant for deploying these models on devices with limited computational resources. - Through automatic multi-step distillation, it is possible to significantly reduce the computational complexity and memory footprint of the model while maintaining high performance, thereby promoting the widespread adoption of large-scale visual models in practical applications.