Abstract:Compared with still image object detection, video object detection (VOD) needs to particularly concern the high across-frame variation in object appearance, and the diverse deterioration in some frames. In principle, the detection in a certain frame of a video can benefit from information in other frames. Thus, how to effectively aggregate features across different frames is key to the target problem. Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs due to the dual-stage nature. On the other hand, although one-stage detectors have made continuous progress in handling static images, their applicability to VOD lacks sufficient exploration. To tackle the above issues, this study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense. Concretely, for cutting the massive computation and memory consumption from the dense prediction characteristic of one-stage object detectors, we first condense candidate features from dense prediction maps. Then, the relationship between a target frame and its reference frames is evaluated to guide the aggregation. Comprehensive experiments and ablation studies are conducted to validate the efficacy of our design, and showcase its advantage over other cutting-edge VOD methods in both effectiveness and efficiency. Notably, our model reaches \emph{a new record performance, i.e., 92.9\% AP50 at over 30 FPS on the ImageNet VID dataset on a single 3090 GPU}, making it a compelling option for large-scale or real-time applications. The implementation is simple, and accessible at \url{<a class="link-external link-https" href="https://github.com/YuHengsss/YOLOV" rel="external noopener nofollow">this https URL</a>}.

Spatial-Temporal Feature Aggregation Network for Video Object Detection

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Adaptive Feature Aggregation for Video Object Detection

Temporal-adaptive sparse feature aggregation for video object detection

Adaptive Scale and Spatial Aggregation for Real-Time Object Detection

SSGA-Net: Stepwise Spatial Global-local Aggregation Networks for for Autonomous Driving

DFA: Dynamic Feature Aggregation for Efficient Video Object Detection

Real-Time and Accurate Object Detection in Compressed Video by Long Short-term Feature Aggregation

Multi-view Aggregation for Real-Time Accurate Object Detection of a Moving Camera

Fianet: Video Object Detection Via Joint Feature-Level and Instance-Level Aggregation

Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

DGRNet: A Dual-Level Graph Relation Network for Video Object Detection

Practical Video Object Detection via Feature Selection and Aggregation

Hierarchical Feature Aggregation Networks for Video Action Recognition

Spatiotemporal tubelet feature aggregation and object linking for small object detection in videos

A Two-Branch Network for Video Anomaly Detection with Spatio-Temporal Feature Learning

Fully Motion-Aware Network for Video Object Detection

Video object detection via space–time feature aggregation and result reuse

Rethinking feature aggregation for deep RGB-D salient object detection