Abstract:Vision-centric Bird's Eye View (BEV) perception, encompassing object detection and map segmentation, plays a pivotal role in providing crucial 3D environmental information for autonomous driving decisions. However, due to the inherent absence of depth information in 2D images, the conversion of perspective views to BEV poses challenges and hinders the performance of camera-based BEV perception in comparison to methods equipped with depth sensors. In this research paper, we propose an innovative approach that integrates depth estimation into camera-based BEV perception. By employing a depth estimation network, the method enhances the transformation of 2D-3D features. Specifically, our method consists of a depth estimation branch and a BEV perception branch. The input image is fed into the shared image encoder to extract multi-scale features. In the depth estimation branch, these features are utilized to generate a depth map through the depth decoder, which, in combination with sequential images and relative pose information, forms the basis for reprojection photometric error, guiding and supervising the branch. To address the challenge of scale ambiguity in monocular depth estimation, we incorporate ground-truth trajectory information collected by an IMU to constrain the predicted depth values, ensuring that the predicted depth is scale-aware. In the BEV perception branch, the afore-mentioned multi-scale features are projected into 3D space along the perspective rays, with the assistance of depth information derived from the depth estimation branch. Subsequently, the 3D features are collapsed along the vertical axis to generate BEV features, which are further input into a task-specific head after feature extraction. Experimental results on the nuScenes dataset demonstrate that our proposed method effectively enhances the performance of BEV-based object detection and map semantic segmentation by 2.8 % and 2.2 %, respectively.

V2I-BEVF: Multi-modal Fusion Based on BEV Representation for Vehicle-Infrastructure Perception

A Fusion Method Aiming at Environmental Perception of Autonomous Vehicle Based on Visual Scheme

ViT-FuseNet: MultiModal Fusion of Vision Transformer for Vehicle-Infrastructure Cooperative Perception

BEV Perception for Autonomous Driving: State of the Art and Future Perspectives

Infrastructure-Assisted Collaborative Perception in Automated Valet Parking: A Safety Perspective

CoBEVFusion: Cooperative Perception with LiDAR-Camera Bird's-Eye View Fusion

Fast-BEV: Towards Real-time On-vehicle Bird's-Eye View Perception

V2VFormer++: Multi-Modal Vehicle-to-Vehicle Cooperative Perception Via Global-Local Transformer

V2VFusion: Multimodal Fusion for Enhanced Vehicle-to-Vehicle Cooperative Perception

Calibration-free BEV Representation for Infrastructure Perception

V2X-AHD:Vehicle-to-Everything Cooperation Perception via Asymmetric Heterogenous Distillation Network

Multi-sensor fusion algorithm in cooperative vehicle-infrastructure system for blind spot warning

V2V Based Visual Cooperative Perception for Connected Autonomous Vehicles: Far-Sight and See-Through

Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline

Delving into the Secrets of BEV 3D Object Detection in Autonomous Driving: A Comprehensive Survey

CoBEV: Elevating Roadside 3D Object Detection with Depth and Height Complementarity

OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation

V2I-Coop: Accurate Object Detection for Connected Automated Vehicles at Accident Black Spots with V2I Cross-Modality Cooperation

UIF-BEV: an Underlying Information Fusion Framework for Bird's-Eye-View Semantic Segmentation

Depth-Assisted Camera-Based Bird's Eye View Perception for Autonomous Driving