Abstract:Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 - 5; mAP while running at around 30 FPS on a single V100 GPU device. Code and models are available at https://github.com/SJTU-LuHe/TransVOD.

T2D: Spatiotemporal Feature Learning Based on Triple 2D Decomposition

Learning Spatiotemporal Relationships with a Unified Framework for Video Object Segmentation

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression

Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training.

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

TPC-ViT: Token Propagation Controller for Efficient Vision Transformer

Group Multi-View Transformer for 3D Shape Analysis with Spatial Encoding

LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference

VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation

TACD: A Novel 3-D Swin Transformer With Enhanced Feature Aggregation for Change Detection in Image Time Series

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs

Denoising Vision Transformers

A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

Introducing Depth into Transformer-based 3D Object Detection