Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

Shengcao Cao,Mengtian Li,James Hays,Deva Ramanan,Yi-Xiong Wang,Liang-Yan Gui
DOI: https://doi.org/10.48550/arXiv.2308.09105
2023-08-18
Abstract:Resource-constrained perception systems such as edge computing and vision-for-robotics require vision models to be both accurate and lightweight in computation and memory usage. While knowledge distillation is a proven strategy to enhance the performance of lightweight classification models, its application to structured outputs like object detection and instance segmentation remains a complicated task, due to the variability in outputs and complex internal network modules involved in the distillation process. In this paper, we propose a simple yet surprisingly effective sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To distill knowledge from a highly accurate but complex teacher model, we construct a sequence of teachers to help the student gradually adapt. Our progressive strategy can be easily combined with existing detection distillation mechanisms to consistently maximize student performance in various settings. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students, and unprecedentedly boost the performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN from 38.2% to 42.5% AP on the MS COCO benchmark.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to make visual models reduce the amount of computation and memory usage while maintaining high accuracy in resource - constrained perception systems, such as edge computing and robotic vision. Specifically, the paper focuses on how to effectively extract knowledge from complex teacher models through knowledge distillation techniques to train lightweight student models, especially in object detection and instance segmentation tasks. Due to the structured characteristics of the outputs of these tasks and the complexity of the internal network modules, traditional knowledge distillation methods are difficult to be directly applied. Therefore, the paper proposes a method of Multi - Teacher Progressive Distillation (MTPD), aiming to overcome the limitations of existing distillation methods in handling structured output tasks, especially when there are large architectural differences between teacher models and student models, such as knowledge transfer from Transformer - based teacher models to convolution - based student models. The main contributions of the paper include: 1. Proposing a framework for learning lightweight detectors through multi - teacher progressive distillation. This method is simple, effective and general, and can automatically design a sequence of teachers suitable for a given student and perform distillation step by step. 2. MTPD is a meta - level strategy that can easily combine previous detection distillation efforts. A comprehensive empirical evaluation has been carried out on the MS COCO dataset, and a consistent performance improvement can be observed regardless of the complexity of the distillation loss. 3. The lightweight RetinaNet and Mask R - CNN learned using MTPD have achieved state - of - the - art accuracy in various settings, especially in heterogeneous backbone and input resolution settings. In particular, the heterogeneous distillation from Transformer - based teacher detectors to convolution - based student models has been studied for the first time, and it has been found that progressive distillation is the key to bridging the gap between them. 4. Empirical analysis shows that the performance improvement mainly comes from better generalization ability rather than better optimization effect. The knowledge transferred from multiple teachers guides the student to reach a flatter minimum, thus helping the student to generalize better. Through the above methods, the paper solves the problem of how to efficiently train high - performance lightweight models in resource - constrained environments, providing strong support for edge devices and real - time applications.