Abstract:Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost. Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm, which benefits from both BEV's strong perception power and end-to-end pipeline. Despite achieving substantial progress, existing works model objects via globally leveraging temporal and spatial information of BEV features, resulting in problems when handling the challenging complex and dynamic autonomous driving scenarios. In this paper, we proposed an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively. OCBEV comprises three designs: Object Aligned Temporal Fusion aligns the BEV feature based on ego-motion and estimated current locations of moving objects, leading to a precise instance-level feature fusion. Object Focused Multi-View Sampling samples more 3D features from an adaptive local height ranges of objects for each scene to enrich foreground information. Object Informed Query Enhancement replaces part of pre-defined decoder queries in common DETR-style decoders with positional features of objects on high-confidence locations, introducing more direct object positional priors. Extensive experimental evaluations are conducted on the challenging nuScenes dataset. Our approach achieves a state-of-the-art result, surpassing the traditional BEVFormer by 1.5 NDS points. Moreover, we have a faster convergence speed and only need half of the training iterations to get comparable performance, which further demonstrates its effectiveness.

What problem does this paper attempt to address?

The paper attempts to address the challenges faced by multi-view 3D object detection in autonomous driving scenarios when dealing with complex dynamic environments. Specifically, existing query-based Bird's Eye View (BEV) detectors, while performing well in static scenes, encounter the following issues when handling moving objects: 1. **Insufficient Temporal Modeling**: Existing methods typically utilize temporal and spatial information globally, failing to effectively capture the temporal changes of moving objects. 2. **Inaccurate Spatial Sampling**: Current spatial sampling methods sample uniformly across the global height range, ignoring the fact that most moving objects are concentrated within a local height range. 3. **Unreasonable Query Design**: Predefined queries are difficult to match with objects during optimization, especially in sparse scenes. To overcome these issues, the authors propose an Object-Centric Query-BEV Detector (OCBEV), which improves existing methods through the following three modules: 1. **Object Aligned Temporal Fusion**: By considering both ego-vehicle motion and object motion, historical BEV features are aligned with current BEV features to achieve precise instance-level feature fusion. 2. **Object Focused Multi-View Sampling**: Predicts an adaptive local height range for each scene and densely samples 3D features within this range to enrich foreground information. 3. **Object Informed Query Enhancement**: Partially replaces predefined decoder queries with object position features from high-confidence locations, introducing more direct object position priors. With these improvements, experimental results on the nuScenes dataset show that OCBEV outperforms existing SOTA methods, particularly improving the NDS metric by 1.5 points. Additionally, OCBEV exhibits faster convergence, achieving comparable performance with only half the training iterations.

OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection

OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for Multi-Camera 3D Object Detection

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection

QE-BEV: Query Evolution for Bird's Eye View Object Detection in Varied Contexts

ROA-BEV: 2D Region-Oriented Attention for BEV-based 3D Object

EVT: Efficient View Transformation for Multi-Modal 3D Object Detection

PreBEV: Leveraging Predictive Flow for Enhanced Bird's-Eye View 3D Dynamic Object Detection

Enhanced 3D object detection for autonomous driving: A spatial-temporal alignment approach in Bird's Eye View scenarios

BEVHeight++: Toward Robust Visual Centric 3D Object Detection

A Streamlined Framework for Bev-Based 3d Object Detection with Prior Masking

Group Equivariant BEV for 3D Object Detection

TS-BEV: BEV object detection algorithm based on temporal-spatial feature fusion

Monocular 3D Object Detection with Motion Feature Distillation.

PersDet: Monocular 3D Detection in Perspective Bird's-Eye-View

DA-BEV: Depth Aware BEV Transformer for 3D Object Detection

GeoBEV: Learning Geometric BEV Representation for Multi-view 3D Object Detection

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

SOGDet: Semantic-Occupancy Guided Multi-View 3D Object Detection