Regulating Intermediate 3D Features for Vision-Centric Autonomous Driving

Junkai Xu,Liang Peng,Haoran Cheng,Linxuan Xia,Qi Zhou,Dan Deng,Wei Qian,Wenxiao Wang,Deng Cai
2023-12-19
Abstract:Multi-camera perception tasks have gained significant attention in the field of autonomous driving. However, existing frameworks based on Lift-Splat-Shoot (LSS) in the multi-camera setting cannot produce suitable dense 3D features due to the projection nature and uncontrollable densification process. To resolve this problem, we propose to regulate intermediate dense 3D features with the help of volume rendering. Specifically, we employ volume rendering to process the dense 3D features to obtain corresponding 2D features (e.g., depth maps, semantic maps), which are supervised by associated labels in the training. This manner regulates the generation of dense 3D features on the feature level, providing appropriate dense and unified features for multiple perception tasks. Therefore, our approach is termed Vampire, stands for "Volume rendering As Multi-camera Perception Intermediate feature REgulator". Experimental results on the Occ3D and nuScenes datasets demonstrate that Vampire facilitates fine-grained and appropriate extraction of dense 3D features, and is competitive with existing SOTA methods across diverse downstream perception tasks like 3D occupancy prediction, LiDAR segmentation and 3D objection detection, while utilizing moderate GPU resources. We provide a video demonstration in the supplementary materials and Codes are available at <a class="link-external link-http" href="http://github.com/cskkxjk/Vampire" rel="external noopener nofollow">this http URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem that the existing Lift - Splats - Shoot (LSS) framework cannot generate suitable dense 3D features in the visual perception tasks of autonomous driving under multi - camera settings. Specifically, due to the projection properties and the uncontrollable densification process, existing methods are difficult to generate appropriate and unified dense 3D features, which limits their application effects in various perception tasks. To this end, the paper proposes a new method - Vampire (Volume rendering As Multi - camera Perception Intermediate feature REgulator). By using volume rendering technology to process dense 3D features to obtain corresponding 2D features (such as depth maps, semantic maps), and supervising these 2D features during the training process, the generation of dense 3D features is regulated at the feature level. This method can provide suitable and unified dense features for multiple perception tasks and perform well in different downstream perception tasks, such as 3D occupancy prediction, LiDAR segmentation and 3D object detection, and only requires moderate GPU resources. ### Main contributions of the paper: 1. **New perspective**: The paper provides a new perspective on intermediate features in vision - centric perception tasks and establishes the connection between occupancy in autonomous driving and volume density in NeRF. 2. **Vampire framework**: A multi - camera perception framework Vampire is introduced. Its key component is using volume rendering as a regulator of dense intermediate 3D features, so that different perception tasks can benefit from the regulated intermediate features. 3. **Multi - task processing ability**: The experimental results show that this method can handle multiple perception tasks in a single forward pass, and with limited computing resources (12GB GPU memory per device), its performance is comparable to the existing state - of - the - art methods. ### Method overview: - **2D to 3D conversion**: Extract 2D image features from multi - view images and convert them into 3D volume space to generate sparse intermediate 3D features. - **Sparse feature completion**: Use a 3D hourglass network to complete the sparse intermediate 3D features to generate dense intermediate 3D features. - **Intermediate feature regulation**: Generate density volumes and semantic volumes through volume rendering technology, and regulate intermediate features by constructing loss functions during the training stage to ensure that the generated 3D features are reasonable and conform to 2D correspondences. ### Experimental results: - **3D occupancy prediction**: The experimental results on the Occ3D - nuScenes dataset show that Vampire reaches 28.3 in the mIoU metric, outperforming multiple baseline methods. - **LiDAR segmentation**: The experimental results on the Panoptic nuScenes validation set show that Vampire reaches 66.4 in the mIoU metric, outperforming multiple baseline methods. - **3D object detection**: The experimental results on the nuScenes validation set show that Vampire reaches 0.301 in the mAP metric, but is slightly inferior to some baseline methods in the NDS metric. This may be because the baseline methods use additional occupancy flow annotations to improve speed perception performance. ### Conclusion: By introducing volume rendering technology as a regulator of intermediate features, Vampire successfully solves the shortcomings of existing methods in generating dense 3D features and provides an effective solution for multi - camera perception tasks.