Abstract:Dense 3D reconstruction has many applications in automated driving including automated annotation validation, multimodal data augmentation, providing ground truth annotations for systems lacking LiDAR, as well as enhancing auto-labeling accuracy. LiDAR provides highly accurate but sparse depth, whereas camera images enable estimation of dense depth but noisy particularly at long ranges. In this paper, we harness the strengths of both sensors and propose a multimodal 3D scene reconstruction using a framework combining neural implicit surfaces and radiance fields. In particular, our method estimates dense and accurate 3D structures and creates an implicit map representation based on signed distance fields, which can be further rendered into RGB images, and depth maps. A mesh can be extracted from the learned signed distance field and culled based on occlusion. Dynamic objects are efficiently filtered on the fly during sampling using 3D object detection models. We demonstrate qualitative and quantitative results on challenging automotive scenes.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to use neural rendering technology to achieve high - precision 3D reconstruction of large - scale urban scenes in autonomous driving?** Specifically, the paper aims to combine the advantages of two sensors, LiDAR and camera, and overcome the limitations of a single sensor to generate dense and accurate 3D structure maps.
### Problem Background
1. **Limitations of Sensor Data**:
- **LiDAR** provides highly accurate but sparse depth information.
- **Camera images** can estimate dense depth, but are noisy at long distances.
2. **Challenges of Existing Methods**:
- **Temporal Consistency**: Due to odometry errors, it is difficult to maintain a spatially consistent dense structure map over a long period of time.
- **Dynamic Objects**: There are a large number of moving objects in urban scenes, which will interfere with the 3D reconstruction process.
- **Fine - Grained Structures**: Urban scenes contain many small 3D structures (such as curbstones, telegraph poles), and these structures are difficult to estimate accurately.
### Solutions Proposed in the Paper
The paper proposes a multimodal 3D scene reconstruction framework, combining **neural implicit surfaces** and **radiance fields** to fully utilize the advantages of LiDAR and camera data. Specific methods include:
1. **Fusion of LiDAR and Camera Data**:
- Use the accurate depth information provided by LiDAR, combined with the dense depth estimation of camera images, to generate a more accurate 3D structure.
2. **Neural Implicit Surface Representation**:
- Use the **Signed Distance Function (SDF)** to represent the scene geometry and regularize it through the Eikonal loss to ensure the accuracy of the geometry.
3. **Dynamic Object Filtering**:
- Based on the 3D object detection model, filter dynamic objects in real - time to avoid their interference with the reconstruction process.
4. **Support for Large - Scale Scenes**:
- Adopt a "divide - and - conquer" strategy, divide large - scale scenes into multiple sub - sequences, train the model for each sub - sequence independently, and finally merge the results.
5. **Supervisory Signals**:
- Use photometric loss, Eikonal loss, and geometric loss to supervise the learning process of the model and ensure the quality of the reconstruction results.
### Application Scenarios
This method can be applied to various tasks in autonomous driving, such as automated annotation verification, multimodal data augmentation, and providing ground - truth annotations for systems lacking LiDAR systems.
### Summary
By combining neural radiance fields and neural implicit surfaces, the paper successfully solves multiple challenges in large - scale urban scene reconstruction, especially in dealing with dynamic objects and fine - grained structures. Experimental results show that this method is superior to methods using only camera data in both quantitative and qualitative evaluations, especially with significant improvements in PSNR and RMSE metrics.