What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper "MS-DETR: Efficient DETR Training with Mixed Supervision" aims to address the issue of the lack of direct supervision for generating multiple object detection candidate boxes during the training process of DETR (Detection Transformer). Specifically, traditional DETR training methods mainly rely on one-to-one supervision, where each predicted candidate box corresponds to a real object. Although this method is effective, it lacks direct supervision for the process of generating multiple candidate boxes, which may result in suboptimal quality of the candidate boxes. To improve the training efficiency and detection performance of DETR, the authors propose a new method called MS-DETR (Mixed Supervision DETR). This method explicitly supervises the generation process of candidate boxes by mixing one-to-one supervision and one-to-many supervision. Specifically, MS-DETR introduces one-to-many supervision on the object queries of the main decoder without requiring additional decoder branches or queries. This design not only improves the quality of the candidate boxes but also maintains the simplicity of the model and the invariance of the inference process. ### Main Contributions 1. **Improved Candidate Box Quality**: By introducing one-to-many supervision, MS-DETR can generate better candidate boxes, thereby improving detection performance. 2. **Enhanced Training Efficiency**: Experimental results show that MS-DETR converges faster during training and outperforms other DETR variants with the same number of training epochs. 3. **Model Simplicity**: Compared to existing DETR variants, MS-DETR does not require additional decoder branches or queries, thus performing better in terms of computational and memory efficiency. 4. **Complementarity**: MS-DETR can be combined with other DETR variants (such as Group DETR, Hybrid DETR, etc.) to further enhance performance. ### Experimental Results Experimental results show that MS-DETR achieves significant performance improvements on various DETR baseline models. For example, with 12 training epochs, MS-DETR improves the mAP by 3.7, 3.7, and 1.8 on DAB-Deformable-DETR, Deformable DETR, and Deformable DETR++ respectively. Additionally, MS-DETR can be combined with existing DETR variants to further enhance performance. ### Conclusion By introducing a mixed supervision mechanism, MS-DETR effectively addresses the issue of the lack of direct supervision in the candidate box generation process of DETR, thereby improving the training efficiency and detection performance of the model. This method not only performs well on various DETR baseline models but also has good computational and memory efficiency, making it a worthwhile improvement scheme to promote.

MS-DETR: Efficient DETR Training with Mixed Supervision

Weakly Supervised Few-Shot Object Detection with DETR

Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

DETR with Additional Global Aggregation for Cross-domain Weakly Supervised Object Detection

Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection

DETRs with Collaborative Hybrid Assignments Training

Efficient DETR: Improving End-to-End Object Detector with Dense Prior

DISTILLING DETR-LIKE DETECTORS WITH INSTANCE-AWARE FEATURE

OD-DETR: Online Distillation for Stabilizing Training of Detection Transformer

RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision

DETRs with Hybrid Matching

Adaptive Token Selection for Efficient Detection Transformer with Dual Teacher Supervision

Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

DETR Doesn't Need Multi-Scale or Locality Design

Conditional DETR for Fast Training Convergence.

MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

DETR Does Not Need Multi-Scale or Locality Design

DETR-ORD: An Improved DETR Detector for Oriented Remote Sensing Object Detection with Feature Reconstruction and Dynamic Query

FeatAug-DETR: Enriching One-to-Many Matching for DETRs with Feature Augmentation

Revisiting DETR Pre-training for Object Detection