Abstract:Visual bird's eye view (BEV) perception, due to its excellent perceptual capabilities, is progressively replacing costly LiDAR-based perception systems, especially in the realm of urban intelligent driving. However, this type of perception still relies on LiDAR data to construct ground truth databases, a process that is both cumbersome and time-consuming. Moreover, most massproduced autonomous driving systems are only equipped with surround camera sensors and lack LiDAR data for precise annotation. To tackle this challenge, we propose a fine-tuning method for BEV perception network based on visual 2D semantic perception, aimed at enhancing the model's generalization capabilities in new scene data. Considering the maturity and development of 2D perception technologies, our method significantly reduces the dependency on high-cost BEV ground truths and shows promising industrial application prospects. Extensive experiments and comparative analyses conducted on the nuScenes and Waymo public datasets demonstrate the effectiveness of our proposed method.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the field of autonomous driving, although the Bird - Eye - View (BEV) perception technology in vision performs excellently, it depends on high - cost LiDAR data to build an accurate annotation database. This process is not only cumbersome and time - consuming, but also most mass - produced autonomous driving systems are only equipped with surround - view camera sensors and lack LiDAR data for accurate annotation. To address this challenge, the author proposes a fine - tuning method based on 2D visual semantic perception, aiming to enhance the generalization ability of the BEV perception model in new scene data, thereby reducing the dependence on high - cost BEV annotation data and showing good prospects for industrial applications. Specifically, the main problems and solutions in this research are as follows: ### 1. **Problem Description** - **Dependence on LiDAR Data**: Current BEV perception systems usually rely on LiDAR data to generate accurate 4D annotation data, which limits their application in low - cost, large - scale production vehicles. - **Lack of 3D Annotation Data**: Most mass - produced vehicles are only equipped with visual sensors (such as cameras) and do not have LiDAR data for accurate annotation, resulting in difficulty in constructing high - quality training data sets. ### 2. **Solutions** - **Fine - Tuning Method Based on 2D Vision**: The author proposes a new fine - tuning framework, using 2D visual semantic perception information to supervise the training of the BEV model. Specific steps include: - **2D Annotation**: Obtain 2D semantic information in surround - view images through manual annotation or using large - scale pre - trained 2D models. - **3D Inference and Projection**: Use the BEV model to infer 3D perception results and project these results onto the surround - view image plane. - **Matching and Loss Function**: Match the projected 3D perception results with the existing 2D annotations and construct a loss function to further fine - tune the parameters of the BEV model. ### 3. **Contributions** - **Low Dependence**: This method significantly reduces the dependence on high - cost BEV annotation data and is suitable for mass - produced vehicles only equipped with visual sensors. - **Efficient Supervision**: An effective loss function is designed, which can accurately match 3D perception results with 2D annotations and improve the model's learning and understanding ability in complex environments. - **Experimental Verification**: Through extensive experiments on public data sets such as nuScenes and Waymo, the effectiveness and superiority of this method are verified, showing its great potential in actual autonomous driving applications. ### 4. **Formula Representation** - **Projection Formula**: The formula for projecting a 3D detection box onto the image coordinate system is: \[ P_I = K\cdot T_{C}^{L}\cdot P_L \] where \( P_L \) represents the point coordinates in the LiDAR coordinate system, \( T_{C}^{L} \) is the external parameter matrix for converting LiDAR coordinates to camera coordinates, \( K \) is the internal parameter matrix of the camera, and \( P_I \) represents the point coordinates in the image coordinate system. - **Loss Function**: The total loss function integrating classification loss, regression loss and IoU loss is: \[ L=\lambda_1 L_{\text{cls}}+\lambda_2 L_{\text{reg}}+\lambda_3 L_{\text{IoU}} \] where \( \lambda_1, \lambda_2, \lambda_3 \) are the weights of classification loss, regression loss and IoU loss respectively. Through this method, the author has successfully solved the problem of how to effectively train the BEV perception model in the absence of LiDAR data and has shown its potential application value in the field of autonomous driving.

Vision-Driven 2D Supervised Fine-Tuning Framework for Bird's Eye View Perception

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Delving Into the Devils of Bird's-Eye-View Perception: A Review, Evaluation and Recipe

OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation

BEVScope: Enhancing Self-Supervised Depth Estimation Leveraging Bird's-Eye-View in Dynamic Scenarios

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Fast-BEV: Towards Real-time On-vehicle Bird's-Eye View Perception

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

Hierarchical and Decoupled BEV Perception Learning Framework for Autonomous Driving

Enhanced 3D object detection for autonomous driving: A spatial-temporal alignment approach in Bird's Eye View scenarios

Improving Bird's Eye View Semantic Segmentation by Task Decomposition

Improving Bird’s Eye View Semantic Segmentation by Task Decomposition

SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection

S2G2: Semi-Supervised Semantic Bird-Eye-View Grid-Map Generation Using a Monocular Camera for Autonomous Driving

Vision-Centric BEV Perception: A Survey

Monocular BEV Perception of Road Scenes Via Front-to-Top View Projection

BEVControl: Accurately Controlling Street-view Elements with Multi-perspective Consistency via BEV Sketch Layout

BLOS-BEV: Navigation Map Enhanced Lane Segmentation Network, Beyond Line of Sight