Abstract:Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 - 5; mAP while running at around 30 FPS on a single V100 GPU device. Code and models are available at https://github.com/SJTU-LuHe/TransVOD.

MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer

TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection

Query-Dependent Video Representation for Moment Retrieval and Highlight Detection

MCT-VHD: Multi-modal contrastive transformer for video highlight detection

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

MV-DETR: Multi-modality indoor object detection by Multi-View DEtecton TRansformers

Saliency-Guided DETR for Moment Retrieval and Highlight Detection

Background-aware Moment Detection for Video Moment Retrieval

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

Length-Aware DETR for Robust Moment Retrieval

GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction

A Multimodal Transformer for Live Streaming Highlight Prediction

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

Multi-Modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection