Relation-Guided Multi-stage Feature Aggregation Network for Video Object Detection.

Tingting Yao,Fuxiao Cao,Fuheng Mi,Danmeng Li
DOI: https://doi.org/10.1007/978-981-99-8537-1_12
2024-01-01
Abstract:Video object detection task has received extensive research attention and various methods have been proposed. The quality of single frame in the original video is usually deteriorated by motion blur and object occlusion, which leads to the failure of detection. Although some methods have attempted to enhance the feature representation of each frame by aggregating temporal context information from other frames, the existing methods are usually sensitive to the change of object appearance and scale, which lead to false or missing detection. Therefore, in this paper, we propose a Relation-guided Multi-stage Feature Aggregation (RMFA) network for video object detection. First, a Multi-Stage Feature Aggregation (MSFA) framework is devised to aggregate the feature representation of global and local support frames in each stage. In this way, both global semantic information and local motion information could be better captured. Furthermore, a Multi-sources Feature Aggregation (MFA) module is proposed to enhance the quality of support frames, hence the feature representation of current frame could be improved. Finally, a Temporal Relation-Guided (TRG) module is proposed to improve the feature aggregation perception by supervising the semantic similarity relationships between different object proposals. Therefore, the model adaptability to selectively store valuable features could be enhanced. Qualitative and quantitative experimental results on the ImageNet VID dataset demonstrate that our model could achieve superior video object detection results against a number of the state-of-the-art ones. Especially, when object is occluded or under fast motion, our model shows outstanding performances.
What problem does this paper attempt to address?