OD-DETR: Online Distillation for Stabilizing Training of Detection Transformer

Shengjian Wu,Li Sun,Qingli Li
2024-06-09
Abstract:DEtection TRansformer (DETR) becomes a dominant paradigm, mainly due to its common architecture with high accuracy and no post-processing. However, DETR suffers from unstable training dynamics. It consumes more data and epochs to converge compared with CNN-based detectors. This paper aims to stabilize DETR training through the online distillation. It utilizes a teacher model, accumulated by Exponential Moving Average (EMA), and distills its knowledge into the online model in following three aspects. First, the matching relation between object queries and ground truth (GT) boxes in the teacher is employed to guide the student, so queries within the student are not only assigned labels based on their own predictions, but also refer to the matching results from the teacher. Second, the teacher's initial query is given to the online student, and its prediction is directly constrained by the corresponding output from the teacher. Finally, the object queries from teacher's different decoding stages are used to build the auxiliary groups to accelerate the convergence. For each GT, two queries with the least matching costs are selected into this extra group, and they predict the GT box and participate the optimization. Extensive experiments show that the proposed OD-DETR successfully stabilizes the training, and significantly increases the performance without bringing in more parameters.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problems of instability and slow convergence in the training process of DETR (DEtection TRansformer). Specifically, although DETR has the advantages of high precision and no need for post - processing, its training dynamics are unstable, and it requires more data and more training epochs to converge, performing poorly compared to CNN - based detectors. To solve these problems, the paper proposes an online distillation method (OD - DETR), which stabilizes the training process of DETR by introducing an exponential moving average (EMA) teacher model. ### Main Solutions 1. **Matching Distillation**: - Use the matching relationship between queries in the teacher model and ground - truth boxes (GT boxes) to guide the student model. - The queries in the student model are assigned labels not only according to their own predictions but also with reference to the matching results from the teacher model. - Calculate the matching cost matrix by the Hungarian algorithm and use the multi - target QFL loss (multi - target Quality - Focal - Loss) to handle multiple possible matching ground - truth boxes while maintaining the consistency of regression targets. 2. **Prediction Distillation**: - Pass the initial queries of the teacher model to the online student model, so that its predictions are directly constrained by the outputs of the teacher model. - Improve the prediction distillation loss \(L_{pd}\) by adjusting the prediction score vector \(c'\) to ensure more stable learning of the student model. 3. **Auxiliary Group**: - Use high - quality queries from different decoding stages of the teacher model to construct an auxiliary group to accelerate convergence. - For each ground - truth box, select the two queries with the lowest matching cost to join the auxiliary group and participate in the optimization process. - The predictions in the auxiliary group also perform matching distillation, combining the original matching results and new matching results to enhance training stability. ### Experimental Results Through a large number of experiments on the MS - COCO dataset, the paper verifies the effectiveness of OD - DETR. The results show that OD - DETR significantly improves performance, especially performing well on different DETR variants. For example, under the 2x training schedule, OD - DETR reaches 47.7 AP, which is 2.3 AP points higher than the baseline model (Def - DETR with iterative bounding box refinement). In addition, OD - DAB - DETR and OD - DINO also achieve significant performance improvements on their respective baselines. ### Summary The OD - DETR proposed in the paper effectively solves the problems of instability and slow convergence in DETR training through the online distillation method, significantly improves the detection performance, and does not require additional parameters.