Abstract:Roadside camera-driven 3D object detection is a crucial task in intelligent transportation systems, which extends the perception range beyond the limitations of vision-centric vehicles and enhances road safety. While previous studies have limitations in using only depth or height information, we find both depth and height matter and they are in fact complementary. The depth feature encompasses precise geometric cues, whereas the height feature is primarily focused on distinguishing between various categories of height intervals, essentially providing semantic context. This insight motivates the development of Complementary-BEV (CoBEV), a novel end-to-end monocular 3D object detection framework that integrates depth and height to construct robust BEV representations. In essence, CoBEV estimates each pixel's depth and height distribution and lifts the camera features into 3D space for lateral fusion using the newly proposed two-stage complementary feature selection (CFS) module. A BEV feature distillation framework is also seamlessly integrated to further enhance the detection accuracy from the prior knowledge of the fusion-modal CoBEV teacher. We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D, as well as the private Supremind-Road dataset, demonstrating that CoBEV not only achieves the accuracy of the new state-of-the-art, but also significantly advances the robustness of previous methods in challenging long-distance scenarios and noisy camera disturbance, and enhances generalization by a large margin in heterologous settings with drastic changes in scene and camera parameters. For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode. The source code will be made publicly available at <a class="link-external link-https" href="https://github.com/MasterHow/CoBEV" rel="external noopener nofollow">this https URL</a>.

BEVSpread: Spread Voxel Pooling for Bird’s-Eye-View Representation in Vision-based Roadside 3D Object Detection

BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection

SSF: Sparse Point Cloud Object Detection Based on Self-Adaptive Voxel Encoding and Focal-Sparse Convolution

BEVHeight: A Robust Framework for Vision-based Roadside 3D Object Detection

BEVHeight++: Toward Robust Visual Centric 3D Object Detection

SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity

OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

SGV3D:Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

BEVUDA: Multi-geometric Space Alignments for Domain Adaptive BEV 3D Object Detection

HotBEV: Hardware-oriented Transformer-based Multi-View 3D Detector for BEV Perception

Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks

Enhanced 3D object detection for autonomous driving: A spatial-temporal alignment approach in Bird's Eye View scenarios

VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention

A Versatile Multi-View Framework for LiDAR-based 3D Object Detection with Guidance from Panoptic Segmentation