MS-DETR: Efficient DETR Training with Mixed Supervision

Chuyang Zhao,Yifan Sun,Wenhao Wang,Qiang Chen,Errui Ding,Yi Yang,Jingdong Wang
2024-01-09
Abstract:DETR accomplishes end-to-end object detection through iteratively generating multiple object candidates based on image features and promoting one candidate for each ground-truth object. The traditional training procedure using one-to-one supervision in the original DETR lacks direct supervision for the object detection candidates.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper "MS-DETR: Efficient DETR Training with Mixed Supervision" aims to address the issue of the lack of direct supervision for generating multiple object detection candidate boxes during the training process of DETR (Detection Transformer). Specifically, traditional DETR training methods mainly rely on one-to-one supervision, where each predicted candidate box corresponds to a real object. Although this method is effective, it lacks direct supervision for the process of generating multiple candidate boxes, which may result in suboptimal quality of the candidate boxes. To improve the training efficiency and detection performance of DETR, the authors propose a new method called MS-DETR (Mixed Supervision DETR). This method explicitly supervises the generation process of candidate boxes by mixing one-to-one supervision and one-to-many supervision. Specifically, MS-DETR introduces one-to-many supervision on the object queries of the main decoder without requiring additional decoder branches or queries. This design not only improves the quality of the candidate boxes but also maintains the simplicity of the model and the invariance of the inference process. ### Main Contributions 1. **Improved Candidate Box Quality**: By introducing one-to-many supervision, MS-DETR can generate better candidate boxes, thereby improving detection performance. 2. **Enhanced Training Efficiency**: Experimental results show that MS-DETR converges faster during training and outperforms other DETR variants with the same number of training epochs. 3. **Model Simplicity**: Compared to existing DETR variants, MS-DETR does not require additional decoder branches or queries, thus performing better in terms of computational and memory efficiency. 4. **Complementarity**: MS-DETR can be combined with other DETR variants (such as Group DETR, Hybrid DETR, etc.) to further enhance performance. ### Experimental Results Experimental results show that MS-DETR achieves significant performance improvements on various DETR baseline models. For example, with 12 training epochs, MS-DETR improves the mAP by 3.7, 3.7, and 1.8 on DAB-Deformable-DETR, Deformable DETR, and Deformable DETR++ respectively. Additionally, MS-DETR can be combined with existing DETR variants to further enhance performance. ### Conclusion By introducing a mixed supervision mechanism, MS-DETR effectively addresses the issue of the lack of direct supervision in the candidate box generation process of DETR, thereby improving the training efficiency and detection performance of the model. This method not only performs well on various DETR baseline models but also has good computational and memory efficiency, making it a worthwhile improvement scheme to promote.