Abstract:Compared with still image object detection, video object detection (VOD) needs to particularly concern the high across-frame variation in object appearance, and the diverse deterioration in some frames. In principle, the detection in a certain frame of a video can benefit from information in other frames. Thus, how to effectively aggregate features across different frames is key to the target problem. Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs due to the dual-stage nature. On the other hand, although one-stage detectors have made continuous progress in handling static images, their applicability to VOD lacks sufficient exploration. To tackle the above issues, this study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense. Concretely, for cutting the massive computation and memory consumption from the dense prediction characteristic of one-stage object detectors, we first condense candidate features from dense prediction maps. Then, the relationship between a target frame and its reference frames is evaluated to guide the aggregation. Comprehensive experiments and ablation studies are conducted to validate the efficacy of our design, and showcase its advantage over other cutting-edge VOD methods in both effectiveness and efficiency. Notably, our model reaches \emph{a new record performance, i.e., 92.9\% AP50 at over 30 FPS on the ImageNet VID dataset on a single 3090 GPU}, making it a compelling option for large-scale or real-time applications. The implementation is simple, and accessible at \url{<a class="link-external link-https" href="https://github.com/YuHengsss/YOLOV" rel="external noopener nofollow">this https URL</a>}.

Fianet: Video Object Detection Via Joint Feature-Level and Instance-Level Aggregation

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

DFA: Dynamic Feature Aggregation for Efficient Video Object Detection

Real-Time and Accurate Object Detection in Compressed Video by Long Short-term Feature Aggregation

Adaptive Feature Aggregation for Video Object Detection

Temporal-adaptive sparse feature aggregation for video object detection

Spatial-Temporal Feature Aggregation Network for Video Object Detection

Fully Motion-Aware Network for Video Object Detection

Feature Agglomeration Networks for Single Stage Face Detection

Adaptive Scale and Spatial Aggregation for Real-Time Object Detection

Video object detection via space–time feature aggregation and result reuse

Feature Aligned Recurrent Network For Causal Video Object Detection

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection

Impression Network for Video Object Detection

ASFD: Automatic and Scalable Face Detector

CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Practical Video Object Detection via Feature Selection and Aggregation

FFAVOD: Feature fusion architecture for video object detection

Spatiotemporal tubelet feature aggregation and object linking for small object detection in videos

DGRNet: A Dual-Level Graph Relation Network for Video Object Detection

Spatial Feature Calibration and Temporal Fusion for Effective One-stage Video Instance Segmentation