Abstract:Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 - 5; mAP while running at around 30 FPS on a single V100 GPU device. Code and models are available at https://github.com/SJTU-LuHe/TransVOD.

Prompt-Guided DETR with RoI-pruned masked attention for open-vocabulary object detection

Prompt-Guided Transformers for End-to-End Open-Vocabulary Object Detection

Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

DETR-ORD: An Improved DETR Detector for Oriented Remote Sensing Object Detection with Feature Reconstruction and Dynamic Query

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

FP-DETR: Detection Transformer Advanced by Fully Pre-training

Visual Modality Prompt for Adapting Vision-Language Object Detectors

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation