Localized Gaussian Splatting Editing with Contextual Awareness

Hanyuan Xiao,Yingshu Chen,Huajian Huang,Haolin Xiong,Jing Yang,Pratusha Prasad,Yajie Zhao
2024-08-01
Abstract:Recent text-guided generation of individual 3D object has achieved great success using diffusion priors. However, these methods are not suitable for object insertion and replacement tasks as they do not consider the background, leading to illumination mismatches within the environment. To bridge the gap, we introduce an illumination-aware 3D scene editing pipeline for 3D Gaussian Splatting (3DGS) representation. Our key observation is that inpainting by the state-of-the-art conditional 2D diffusion model is consistent with background in lighting. To leverage the prior knowledge from the well-trained diffusion models for 3D object generation, our approach employs a coarse-to-fine objection optimization pipeline with inpainted views. In the first coarse step, we achieve image-to-3D lifting given an ideal inpainted view. The process employs 3D-aware diffusion prior from a view-conditioned diffusion model, which preserves illumination present in the conditioning image. To acquire an ideal inpainted image, we introduce an Anchor View Proposal (AVP) algorithm to find a single view that best represents the scene illumination in target region. In the second Texture Enhancement step, we introduce a novel Depth-guided Inpainting Score Distillation Sampling (DI-SDS), which enhances geometry and texture details with the inpainting diffusion prior, beyond the scope of the 3D-aware diffusion prior knowledge in the first coarse step. DI-SDS not only provides fine-grained texture enhancement, but also urges optimization to respect scene lighting. Our approach efficiently achieves local editing with global illumination consistency without explicitly modeling light transport. We demonstrate robustness of our method by evaluating editing in real scenes containing explicit highlight and shadows, and compare against the state-of-the-art text-to-3D editing methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to solve is to perform text - guided local editing in 3D scenes, especially the object replacement task, so that it can be naturally integrated into the original scene, including lighting and occlusion effects. Specifically, when existing methods are directly applied to scene - editing tasks (such as object replacement, object insertion), due to the lack of consideration of global information, inconsistent lighting or unrealistic occlusion occurs. To solve these problems, the authors propose a new lighting - aware 3D scene - editing pipeline based on 3D Gaussian Splatting (3DGS) representation. ### Main Contributions 1. **Comprehensive Pipeline**: A method for generating objects based on text input is proposed. These objects can be seamlessly integrated into 3D Gaussian Splatting scenes, with special emphasis on the fact that automatically generated objects can automatically match the global lighting. 2. **Anchor View Selection Algorithm**: The Anchor View Proposal (AVP) algorithm is introduced to automatically select the best view representing the lighting within the target area. The selection of this view provides the characteristics of the scene lighting and is helpful for harmonious object synthesis. 3. **Depth - guided Inpainting Score Distillation Sampling**: Depth - guided Inpainting Score Distillation Sampling (DI - SDS) is proposed. This method combines geometric conditions and contextual lighting for object generation and texture enhancement. ### Method Overview 1. **Anchor View Selection**: - Use the capabilities of the multi - view diffusion model to select the anchor view containing the strongest lighting cues (such as shadows and highlights) from multiple rendered views. - Achieve this goal by rotating the image and selecting the rotation angle that makes most of the bright pixels on the left side. - Convert the RGB image to the HSV color space to estimate the lighting more accurately. 2. **Context - aware Coarse - to - Fine 3D Generation**: - **Coarse Image - to - 3D Generation**: Use a pre - trained depth - conditioned diffusion model to inpaint the mask of the bounding box projection, and extract the foreground according to the generated text prompt as the input for the next 3D lifting step. - **Lighting - aware Texture Enhancement**: On the basis of the coarse generation, further refine the geometric and texture details while preserving the multi - view lighting conditions. ### Experimental Results - **Qualitative Results**: Show multi - view rendering results in different scenes, demonstrating the detailed texture of the generated objects and the fidelity to the scene lighting. - **Quantitative Results**: Through CLIP scores and user studies, it is proved that the generated objects have higher consistency with the input text prompts and are more realistic in the scene. ### Conclusion This paper proposes an end - to - end pipeline that solves the problem of text - guided local editing of 3D scenes, especially in the object replacement task, achieving lighting consistency for natural integration into the scene. By introducing the anchor view selection algorithm and depth - guided inpainting score distillation sampling, the quality and realism of the generated objects are significantly improved.