Abstract:In this paper, we provide the observation that too few queries assigned as positive samples in DETR with one-to-one set matching leads to sparse supervision on the encoder's output which considerably hurt the discriminative feature learning of the encoder and vice visa for attention learning in the decoder. To alleviate this, we present a novel collaborative hybrid assignments training scheme, namely $\mathcal{C}$o-DETR, to learn more efficient and effective DETR-based detectors from versatile label assignment manners. This new training scheme can easily enhance the encoder's learning ability in end-to-end detectors by training the multiple parallel auxiliary heads supervised by one-to-many label assignments such as ATSS and Faster RCNN. In addition, we conduct extra customized positive queries by extracting the positive coordinates from these auxiliary heads to improve the training efficiency of positive samples in the decoder. In inference, these auxiliary heads are discarded and thus our method introduces no additional parameters and computational cost to the original detector while requiring no hand-crafted non-maximum suppression (NMS). We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants, including DAB-DETR, Deformable-DETR, and DINO-Deformable-DETR. The state-of-the-art DINO-Deformable-DETR with Swin-L can be improved from 58.5% to 59.5% AP on COCO val. Surprisingly, incorporated with ViT-L backbone, we achieve 66.0% AP on COCO test-dev and 67.9% AP on LVIS val, outperforming previous methods by clear margins with much fewer model sizes. Codes are available at \url{<a class="link-external link-https" href="https://github.com/Sense-X/Co-DETR" rel="external noopener nofollow">this https URL</a>}.

TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection

MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

Length-Aware DETR for Robust Moment Retrieval

MCT-VHD: Multi-modal contrastive transformer for video highlight detection

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

DETRs with Hybrid Matching

GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

Multi-Modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Background-aware Moment Detection for Video Moment Retrieval

MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction

Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

DETRs with Collaborative Hybrid Assignments Training

MV-DETR: Multi-modality indoor object detection by Multi-View DEtecton TRansformers

Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding

End-to-End Video Text Spotting with Transformer