Vision-Driven 2D Supervised Fine-Tuning Framework for Bird's Eye View Perception

Lei He,Qiaoyi Wang,Honglin Sun,Qing Xu,Bolin Gao,Shengbo Eben Li,Jianqiang Wang,Keqiang Li
DOI: https://doi.org/10.48550/arXiv.2409.05834
2024-09-10
Abstract:Visual bird's eye view (BEV) perception, due to its excellent perceptual capabilities, is progressively replacing costly LiDAR-based perception systems, especially in the realm of urban intelligent driving. However, this type of perception still relies on LiDAR data to construct ground truth databases, a process that is both cumbersome and time-consuming. Moreover, most massproduced autonomous driving systems are only equipped with surround camera sensors and lack LiDAR data for precise annotation. To tackle this challenge, we propose a fine-tuning method for BEV perception network based on visual 2D semantic perception, aimed at enhancing the model's generalization capabilities in new scene data. Considering the maturity and development of 2D perception technologies, our method significantly reduces the dependency on high-cost BEV ground truths and shows promising industrial application prospects. Extensive experiments and comparative analyses conducted on the nuScenes and Waymo public datasets demonstrate the effectiveness of our proposed method.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the field of autonomous driving, although the Bird - Eye - View (BEV) perception technology in vision performs excellently, it depends on high - cost LiDAR data to build an accurate annotation database. This process is not only cumbersome and time - consuming, but also most mass - produced autonomous driving systems are only equipped with surround - view camera sensors and lack LiDAR data for accurate annotation. To address this challenge, the author proposes a fine - tuning method based on 2D visual semantic perception, aiming to enhance the generalization ability of the BEV perception model in new scene data, thereby reducing the dependence on high - cost BEV annotation data and showing good prospects for industrial applications. Specifically, the main problems and solutions in this research are as follows: ### 1. **Problem Description** - **Dependence on LiDAR Data**: Current BEV perception systems usually rely on LiDAR data to generate accurate 4D annotation data, which limits their application in low - cost, large - scale production vehicles. - **Lack of 3D Annotation Data**: Most mass - produced vehicles are only equipped with visual sensors (such as cameras) and do not have LiDAR data for accurate annotation, resulting in difficulty in constructing high - quality training data sets. ### 2. **Solutions** - **Fine - Tuning Method Based on 2D Vision**: The author proposes a new fine - tuning framework, using 2D visual semantic perception information to supervise the training of the BEV model. Specific steps include: - **2D Annotation**: Obtain 2D semantic information in surround - view images through manual annotation or using large - scale pre - trained 2D models. - **3D Inference and Projection**: Use the BEV model to infer 3D perception results and project these results onto the surround - view image plane. - **Matching and Loss Function**: Match the projected 3D perception results with the existing 2D annotations and construct a loss function to further fine - tune the parameters of the BEV model. ### 3. **Contributions** - **Low Dependence**: This method significantly reduces the dependence on high - cost BEV annotation data and is suitable for mass - produced vehicles only equipped with visual sensors. - **Efficient Supervision**: An effective loss function is designed, which can accurately match 3D perception results with 2D annotations and improve the model's learning and understanding ability in complex environments. - **Experimental Verification**: Through extensive experiments on public data sets such as nuScenes and Waymo, the effectiveness and superiority of this method are verified, showing its great potential in actual autonomous driving applications. ### 4. **Formula Representation** - **Projection Formula**: The formula for projecting a 3D detection box onto the image coordinate system is: \[ P_I = K\cdot T_{C}^{L}\cdot P_L \] where \( P_L \) represents the point coordinates in the LiDAR coordinate system, \( T_{C}^{L} \) is the external parameter matrix for converting LiDAR coordinates to camera coordinates, \( K \) is the internal parameter matrix of the camera, and \( P_I \) represents the point coordinates in the image coordinate system. - **Loss Function**: The total loss function integrating classification loss, regression loss and IoU loss is: \[ L=\lambda_1 L_{\text{cls}}+\lambda_2 L_{\text{reg}}+\lambda_3 L_{\text{IoU}} \] where \( \lambda_1, \lambda_2, \lambda_3 \) are the weights of classification loss, regression loss and IoU loss respectively. Through this method, the author has successfully solved the problem of how to effectively train the BEV perception model in the absence of LiDAR data and has shown its potential application value in the field of autonomous driving.