Abstract:Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel's depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at CoBEV.

CoDRMA: Collaborative Depth Refinement Via Dual-Mask and Dual-Attention for Bird’s Eye View Collaborative 3D Object Detection

Collaboration Helps Camera Overtake LiDAR in 3D Detection

BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection

KAN-RCBEVDepth: A multi-modal fusion algorithm in object detection for autonomous driving

UVCPNet: A UAV-Vehicle Collaborative Perception Network for 3D Object Detection

OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for Multi-Camera 3D Object Detection

CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity

BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection

BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for Multi-View BEV 3D Object Detection

Instance-aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning

Co-Fix3D: Enhancing 3D Object Detection with Collaborative Refinement

Bridging the View Disparity Between Radar and Camera Features for Multi-Modal Fusion 3D Object Detection

BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for BEV 3D Object Detection

ODM3D: Alleviating Foreground Sparsity for Semi-Supervised Monocular 3D Object Detection

Enhance the 3D Object Detection With 2D Prior

RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection

A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

MaskBEV: Joint Object Detection and Footprint Completion for Bird's-eye View 3D Point Clouds