OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection

Zhangyang Qi,Jiaqi Wang,Xiaoyang Wu,Hengshuang Zhao
2023-06-03
Abstract:Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost. Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm, which benefits from both BEV's strong perception power and end-to-end pipeline. Despite achieving substantial progress, existing works model objects via globally leveraging temporal and spatial information of BEV features, resulting in problems when handling the challenging complex and dynamic autonomous driving scenarios. In this paper, we proposed an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively. OCBEV comprises three designs: Object Aligned Temporal Fusion aligns the BEV feature based on ego-motion and estimated current locations of moving objects, leading to a precise instance-level feature fusion. Object Focused Multi-View Sampling samples more 3D features from an adaptive local height ranges of objects for each scene to enrich foreground information. Object Informed Query Enhancement replaces part of pre-defined decoder queries in common DETR-style decoders with positional features of objects on high-confidence locations, introducing more direct object positional priors. Extensive experimental evaluations are conducted on the challenging nuScenes dataset. Our approach achieves a state-of-the-art result, surpassing the traditional BEVFormer by 1.5 NDS points. Moreover, we have a faster convergence speed and only need half of the training iterations to get comparable performance, which further demonstrates its effectiveness.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the challenges faced by multi-view 3D object detection in autonomous driving scenarios when dealing with complex dynamic environments. Specifically, existing query-based Bird's Eye View (BEV) detectors, while performing well in static scenes, encounter the following issues when handling moving objects: 1. **Insufficient Temporal Modeling**: Existing methods typically utilize temporal and spatial information globally, failing to effectively capture the temporal changes of moving objects. 2. **Inaccurate Spatial Sampling**: Current spatial sampling methods sample uniformly across the global height range, ignoring the fact that most moving objects are concentrated within a local height range. 3. **Unreasonable Query Design**: Predefined queries are difficult to match with objects during optimization, especially in sparse scenes. To overcome these issues, the authors propose an Object-Centric Query-BEV Detector (OCBEV), which improves existing methods through the following three modules: 1. **Object Aligned Temporal Fusion**: By considering both ego-vehicle motion and object motion, historical BEV features are aligned with current BEV features to achieve precise instance-level feature fusion. 2. **Object Focused Multi-View Sampling**: Predicts an adaptive local height range for each scene and densely samples 3D features within this range to enrich foreground information. 3. **Object Informed Query Enhancement**: Partially replaces predefined decoder queries with object position features from high-confidence locations, introducing more direct object position priors. With these improvements, experimental results on the nuScenes dataset show that OCBEV outperforms existing SOTA methods, particularly improving the NDS metric by 1.5 points. Additionally, OCBEV exhibits faster convergence, achieving comparable performance with only half the training iterations.