Abstract:Dense 3D reconstruction has many applications in automated driving including automated annotation validation, multimodal data augmentation, providing ground truth annotations for systems lacking LiDAR, as well as enhancing auto-labeling accuracy. LiDAR provides highly accurate but sparse depth, whereas camera images enable estimation of dense depth but noisy particularly at long ranges. In this paper, we harness the strengths of both sensors and propose a multimodal 3D scene reconstruction using a framework combining neural implicit surfaces and radiance fields. In particular, our method estimates dense and accurate 3D structures and creates an implicit map representation based on signed distance fields, which can be further rendered into RGB images, and depth maps. A mesh can be extracted from the learned signed distance field and culled based on occlusion. Dynamic objects are efficiently filtered on the fly during sampling using 3D object detection models. We demonstrate qualitative and quantitative results on challenging automotive scenes.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to use neural rendering technology to achieve high - precision 3D reconstruction of large - scale urban scenes in autonomous driving?** Specifically, the paper aims to combine the advantages of two sensors, LiDAR and camera, and overcome the limitations of a single sensor to generate dense and accurate 3D structure maps. ### Problem Background 1. **Limitations of Sensor Data**: - **LiDAR** provides highly accurate but sparse depth information. - **Camera images** can estimate dense depth, but are noisy at long distances. 2. **Challenges of Existing Methods**: - **Temporal Consistency**: Due to odometry errors, it is difficult to maintain a spatially consistent dense structure map over a long period of time. - **Dynamic Objects**: There are a large number of moving objects in urban scenes, which will interfere with the 3D reconstruction process. - **Fine - Grained Structures**: Urban scenes contain many small 3D structures (such as curbstones, telegraph poles), and these structures are difficult to estimate accurately. ### Solutions Proposed in the Paper The paper proposes a multimodal 3D scene reconstruction framework, combining **neural implicit surfaces** and **radiance fields** to fully utilize the advantages of LiDAR and camera data. Specific methods include: 1. **Fusion of LiDAR and Camera Data**: - Use the accurate depth information provided by LiDAR, combined with the dense depth estimation of camera images, to generate a more accurate 3D structure. 2. **Neural Implicit Surface Representation**: - Use the **Signed Distance Function (SDF)** to represent the scene geometry and regularize it through the Eikonal loss to ensure the accuracy of the geometry. 3. **Dynamic Object Filtering**: - Based on the 3D object detection model, filter dynamic objects in real - time to avoid their interference with the reconstruction process. 4. **Support for Large - Scale Scenes**: - Adopt a "divide - and - conquer" strategy, divide large - scale scenes into multiple sub - sequences, train the model for each sub - sequence independently, and finally merge the results. 5. **Supervisory Signals**: - Use photometric loss, Eikonal loss, and geometric loss to supervise the learning process of the model and ensure the quality of the reconstruction results. ### Application Scenarios This method can be applied to various tasks in autonomous driving, such as automated annotation verification, multimodal data augmentation, and providing ground - truth annotations for systems lacking LiDAR systems. ### Summary By combining neural radiance fields and neural implicit surfaces, the paper successfully solves multiple challenges in large - scale urban scene reconstruction, especially in dealing with dynamic objects and fine - grained structures. Experimental results show that this method is superior to methods using only camera data in both quantitative and qualitative evaluations, especially with significant improvements in PSNR and RMSE metrics.

Neural Rendering based Urban Scene Reconstruction for Autonomous Driving

MD-Surf: Multimodal Neural Surface Reconstruction from Driving Views

Self-driving Simulation Scene Reconstruction Using Self-Supervised Depth Completion.

Enhancing Scene Simulation for Autonomous Driving with Neural Point Rendering

Large-Scale Neural Scene Disentanglement Approach for Self-Driving Simulation

READ: Large-Scale Neural Scene Rendering for Autonomous Driving

LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes

3D Scene Reconstruction with Sparse LiDAR Data and Monocular Image in Single Frame

HarmonicNeRF: Geometry-Informed Synthetic View Augmentation for 3D Scene Reconstruction in Driving Scenarios

Neural 3D Reconstruction in the Wild

Multi-Level Neural Scene Graphs for Dynamic Urban Environments

Neural Fields Meet Explicit Geometric Representations for Inverse Rendering of Urban Scenes

Camera and LiDAR Fusion for Urban Scene Reconstruction and Novel View Synthesis via Voxel-Based Neural Radiance Fields

Neural Fields meet Explicit Geometric Representation for Inverse Rendering of Urban Scenes

NeurAR: Neural Uncertainty for Autonomous 3D Reconstruction With Implicit Neural Representations

Lightning NeRF: Efficient Hybrid Scene Representation for Autonomous Driving

LiHi-GS: LiDAR-Supervised Gaussian Splatting for Highway Driving Scene Reconstruction

S-NeRF++: Autonomous Driving Simulation via Neural Reconstruction and Generation

Efficient Implicit Neural Reconstruction Using LiDAR

Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes

UniGaussian: Driving Scene Reconstruction from Multiple Camera Models via Unified Gaussian Representations