Abstract:Novel View Synthesis (NVS) from unconstrained photo collections is challenging in computer graphics. Recently, 3D Gaussian Splatting (3DGS) has shown promise for photorealistic and real-time NVS of static scenes. Building on 3DGS, we propose an efficient point-based differentiable rendering framework for scene reconstruction from photo collections. Our key innovation is a residual-based spherical harmonic coefficients transfer module that adapts 3DGS to varying lighting conditions and photometric post-processing. This lightweight module can be pre-computed and ensures efficient gradient propagation from rendered images to 3D Gaussian attributes. Additionally, we observe that the appearance encoder and the transient mask predictor, the two most critical parts of NVS from unconstrained photo collections, can be mutually beneficial. We introduce a plug-and-play lightweight spatial attention module to simultaneously predict transient occluders and latent appearance representation for each image. After training and preprocessing, our method aligns with the standard 3DGS format and rendering pipeline, facilitating seamlessly integration into various 3DGS applications. Extensive experiments on diverse datasets show our approach outperforms existing approaches on the rendering quality of novel view and appearance synthesis with high converge and rendering speed.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
This paper aims to solve the problems of real - scene reconstruction and novel view synthesis (NVS) from unconstrained photo collections. Specifically, the paper focuses on the following key challenges:
1. **Illumination conditions and post - processing variations**: In unconstrained photo collections, due to different shooting times and locations, illumination conditions and post - processing of photos may vary greatly. These variations have a negative impact on the quality of scene reconstruction and novel view synthesis.
2. **Prediction of transient occluders**: Images in unconstrained photo collections may contain transient occluders such as pedestrians and vehicles. Accurately predicting these transient occluders is crucial for generating high - quality novel views and appearance synthesis.
3. **Temporal - spatial efficiency**: Existing methods usually need to introduce additional learnable parameters and training strategies when dealing with unconstrained photo collections, resulting in slow convergence of the training process and inability to achieve real - time rendering and fast training.
4. **Compact data storage**: Existing methods often require a large amount of memory storage when dealing with large - scale unconstrained photo collections, which limits their wide use in practical applications.
### Solutions
To address the above challenges, the paper proposes **WE - GS** (Weighted Efficient 3D Gaussian Splatting), a point - based differentiable rendering framework for reconstructing scenes from unconstrained photo collections. The main innovations of WE - GS include:
1. **Residual - based spherical harmonic coefficient transfer module**: This module adapts to different illumination conditions and post - processing by learning image - specific residual spherical harmonic coefficients. This module is lightweight and pre - computable, ensuring efficient gradient propagation while retaining the efficiency of vanilla 3DGS.
2. **Lightweight spatial attention module**: This module simultaneously predicts transient occluder masks and latent appearance representations, improving the accuracy of transient occluder prediction and the representativeness of latent appearance representations. This design takes advantage of the mutual benefits between the appearance encoder and the transient occluder predictor.
3. **Optimization process**: By introducing multiple loss functions (such as L1 loss, structural similarity index (SSIM) loss, regularization loss, etc.), it is ensured that the model can efficiently optimize parameters during the training process and generate high - quality novel view and appearance synthesis results.
### Experimental results
The paper has carried out extensive experiments on multiple datasets, including the PhotoTourism dataset and the NeRF - OSR dataset. The experimental results show that WE - GS has reached a new state - of - the - art level in terms of training speed, rendering frame rate (FPS), and the quality of novel view or novel appearance synthesis. Specifically:
- On the PhotoTourism dataset, while maintaining real - time rendering speed, WE - GS reduces the storage requirement by more than 2 times and increases the average PSNR by 6.6 dB.
- On the NeRF - OSR dataset, WE - GS outperforms other methods in terms of metrics such as PSNR, SSIM, and LPIPS.
### Summary
By introducing the residual - based spherical harmonic coefficient transfer module and the lightweight spatial attention module, WE - GS effectively solves the problem of efficient and high - quality scene reconstruction and novel view synthesis from unconstrained photo collections. This method not only reaches a new state - of - the - art level in performance, but also performs well in terms of temporal - spatial efficiency and data storage.