Abstract:Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 - 5; mAP while running at around 30 FPS on a single V100 GPU device. Code and models are available at https://github.com/SJTU-LuHe/TransVOD.

TSDNN: tube sorting with deep neural networks for surveillance video synopsis

DTVNet: Dynamic Time-Lapse Video Generation via Single Still Image

Scene Adaptive Online Surveillance Video Synopsis via Dynamic Tube Rearrangement Using Octree

Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation

Finding Action Tubes with a Sparse-to-Dense Framework

Surveillance video synopsis framework base on tube set

Deformable Tube Network for Action Detection in Videos

Object Detection from Video Tubelets with Convolutional Neural Networks

TrackNet: Simultaneous Object Detection and Tracking and Its Application in Traffic Video Analysis

Sparse Action Tube Detection

Object Detection in Videos by High Quality Object Linking

Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos

Delving into Details: Synopsis-to-Detail Networks for Video Recognition.

Background subtraction for video sequence using deep neural network

Deep Learning and Hybrid Approaches for Dynamic Scene Analysis, Object Detection and Motion Tracking

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Video Anomaly Detection using Pre-Trained Deep Convolutional Neural Nets and Context Mining

TEDdet: Temporal Feature Exchange and Difference Network for Online Real-Time Action Detection

Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification

TubeR: Tubelet Transformer for Video Action Detection

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers