Cyclic Refiner: Object-Aware Temporal Representation Learning for Multi-view 3D Detection and Tracking

Mingzhe Guo,Zhipeng Zhang,Liping Jing,Yuan He,Ke Wang,Heng Fan
DOI: https://doi.org/10.1007/s11263-024-02176-7
IF: 13.369
2024-07-16
International Journal of Computer Vision
Abstract:Abstract We propose a unified object-aware temporal learning framework for multi-view 3D detection and tracking tasks. Having observed that the efficacy of the temporal fusion strategy in recent multi-view perception methods may be weakened by distractors and background clutters in historical frames, we propose a cyclic learning mechanism to improve the robustness of multi-view representation learning. The essence is constructing a backward bridge to propagate information from model predictions ( e.g., object locations and sizes) to image and BEV features, which forms a circle with regular inference. After backward refinement, the responses of target-irrelevant regions in historical frames would be suppressed, decreasing the risk of polluting future frames and improving the object awareness ability of temporal fusion. We further tailor an object-aware association strategy for tracking based on the cyclic learning model. The cyclic learning model not only provides refined features, but also delivers finer clues ( e.g., scale level) for tracklet association. The proposed cycle learning method and association module together contribute a novel and unified multi-task framework. Experiments on nuScenes show that the proposed model achieves consistent performance gains over baselines of different designs ( i.e., dense query-based BEVFormer, sparse query-based SparseBEV and LSS-based BEVDet4D) on both detection and tracking evaluation. Codes and models will be released.
computer science, artificial intelligence
What problem does this paper attempt to address?