Spatial-Temporal Feature Aggregation Network for Video Object Detection

Zhu Chen,Weihai Li,Chi Fei,Bin Liu,Nenghai Yu
DOI: https://doi.org/10.1109/icassp40776.2020.9054080
2020-01-01
Abstract:Video object detection is a challenging problem in computer vision. In this paper, we propose a novel spatial-temporal feature aggregation network to deal with this issue. Specifically, we present a novel instance-level feature aggregation module as complementary to traditional pixel-level feature aggregation, in which we build a new movement estimation module to learn instance movements across frames. Then the Graph Convolutional Networks (GCNs) is applied to obtain temporal relation among instances over frames to implement instance-level feature aggregation. At last, we combine pixel-level and instance-level features by learnable soft weights to make use of their complementary information. Our framework is simple to implement and enables end-to-end training, which achieves state-of-art performance on the ImageNet VID dataset by extensive experiments.
What problem does this paper attempt to address?