Abstract:Recent progress in video object detection (VOD) has shown that aggregating features from other frames to capture long-range contextual information is very important to deal with the challenges in VOD, such as partial occlusion, motion blur, etc. To exploit more effective feature aggregation, we propose several improvements over previous works in this paper: (1) a class-aware pixel-level feature aggregation module, which characterizes a pixel by exploiting the context information lying in the instances from both the current frame and other frames. Different from the previous non-local operation, the proposed class-aware pixel-level feature aggregation filters out most of the noisy information from the large scope of background and objects in different classes, and only enhances representation of a foreground pixel with the same class instances with limited ambiguous information; (2) a class-aware instance-level feature aggregation module, which aggregates features for object proposals by learning two kinds of relations: the temporal dependencies among the same class object proposals from support frames sampled in a long time range or even the whole sequence, and spatial topology relation among proposals of different objects in the target frame. The homogeneity constraint in instance-level feature aggregation filters out many defective proposals, making the feature aggregation more accurate; and (3) a correlation-based feature alignment module embedded in the instance-level feature aggregation, which aligns the feature maps of the support and target proposals. Without bells or whistles, the proposed method achieves state-of-the-art performance on the ImageNet VID dataset without any post-processing methods. This project is publicly available https://github.com/LiangHann/Class-aware-Feature-Aggregation-Network-for-Video-Object-Detection .

DFA: Dynamic Feature Aggregation for Efficient Video Object Detection

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Temporal-adaptive sparse feature aggregation for video object detection

Adaptive Feature Aggregation for Video Object Detection

Guided Sampling Based Feature Aggregation for Video Object Detection.

Object Detection in Video with Spatial-temporal Context Aggregation.

Non-dense Feature Aggregation for Video Object Detection

Relation-Guided Multi-stage Feature Aggregation Network for Video Object Detection.

Video Object Detection by Aggregating Features Across Adjacent Frames

Exploiting Better Feature Aggregation for Video Object Detection

Dualfeat: Dual Feature Aggregation for Video Object Detection.

Fianet: Video Object Detection Via Joint Feature-Level and Instance-Level Aggregation

Practical Video Object Detection via Feature Selection and Aggregation

Real-Time and Accurate Object Detection in Compressed Video by Long Short-term Feature Aggregation

Learning intra-inter semantic aggregation for video object detection

Video object detection via space–time feature aggregation and result reuse

Spatial-Temporal Feature Aggregation Network for Video Object Detection

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

Multi-Focus Guided Semantic Aggregation for Video Object Detection.

Adaptive Scale and Spatial Aggregation for Real-Time Object Detection

Class-aware Feature Aggregation Network for Video Object Detection