Abstract:In this paper, we provide the observation that too few queries assigned as positive samples in DETR with one-to-one set matching leads to sparse supervision on the encoder's output which considerably hurt the discriminative feature learning of the encoder and vice visa for attention learning in the decoder. To alleviate this, we present a novel collaborative hybrid assignments training scheme, namely $\mathcal{C}$o-DETR, to learn more efficient and effective DETR-based detectors from versatile label assignment manners. This new training scheme can easily enhance the encoder's learning ability in end-to-end detectors by training the multiple parallel auxiliary heads supervised by one-to-many label assignments such as ATSS and Faster RCNN. In addition, we conduct extra customized positive queries by extracting the positive coordinates from these auxiliary heads to improve the training efficiency of positive samples in the decoder. In inference, these auxiliary heads are discarded and thus our method introduces no additional parameters and computational cost to the original detector while requiring no hand-crafted non-maximum suppression (NMS). We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants, including DAB-DETR, Deformable-DETR, and DINO-Deformable-DETR. The state-of-the-art DINO-Deformable-DETR with Swin-L can be improved from 58.5% to 59.5% AP on COCO val. Surprisingly, incorporated with ViT-L backbone, we achieve 66.0% AP on COCO test-dev and 67.9% AP on LVIS val, outperforming previous methods by clear margins with much fewer model sizes. Codes are available at \url{<a class="link-external link-https" href="https://github.com/Sense-X/Co-DETR" rel="external noopener nofollow">this https URL</a>}.

AParC-DETR: Accelerate DETR training by introducing Adaptive Position-aware Circular Convolution

PMG-DETR: fast convergence of DETR with position-sensitive multi-scale attention and grouped queries

Conditional DETR for Fast Training Convergence.

Efficient DETR: Improving End-to-End Object Detector with Dense Prior

Conditional DETR V2: Efficient Detection Transformer with Box Queries

FP-DETR: Detection Transformer Advanced by Fully Pre-training

DETR++: Taming Your Multi-Scale Detection Transformer

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion

Cross Resolution Encoding-Decoding For Detection Transformers

SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency

L-DETR: A Light-Weight Detector for End-to-End Object Detection With Transformers

DHS-DETR: Efficient DETRs with Dynamic Head Switching

Accelerating DETR Convergence via Semantic-Aligned Matching

V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

Towards Data-Efficient Detection Transformers

PR-Deformable DETR: DETR for Remote Sensing Object Detection

DETRs with Collaborative Hybrid Assignments Training

Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection

Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

End-to-End Object Detection with Adaptive Clustering Transformer