ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

Jun-Kun Chen,Samuel Rota Bulò,Norman Müller,Lorenzo Porzi,Peter Kontschieder,Yu-Xiong Wang
2024-06-14
Abstract:This paper proposes ConsistDreamer - a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency, thus enabling high-fidelity instruction-guided scene editing. To overcome the fundamental limitation of missing 3D consistency in 2D diffusion models, our key insight is to introduce three synergetic strategies that augment the input of the 2D diffusion model to become 3D-aware and to explicitly enforce 3D consistency during the training process. Specifically, we design surrounding views as context-rich input for the 2D diffusion model, and generate 3D-consistent, structured noise instead of image-independent noise. Moreover, we introduce self-supervised consistency-enforcing training within the per-scene editing procedure. Extensive evaluation shows that our ConsistDreamer achieves state-of-the-art performance for instruction-guided scene editing across various scenes and editing instructions, particularly in complicated large-scale indoor scenes from ScanNet++, with significantly improved sharpness and fine-grained textures. Notably, ConsistDreamer stands as the first work capable of successfully editing complex (e.g., plaid/checkered) patterns. Our project page is at <a class="link-external link-http" href="http://immortalco.github.io/ConsistDreamer" rel="external noopener nofollow">this http URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the lack of 3D consistency when using 2D diffusion models in 3D scene editing. Specifically, existing 2D diffusion models may produce inconsistent results in color and shape when editing between different viewpoints, especially when dealing with complex large - scale indoor scenes, and this inconsistency is more obvious. For example, a person may be edited to be wearing a red shirt in one viewpoint, but wearing a green shirt in another viewpoint. When these inconsistent images are used to train Neural Radiance Fields (NeRF), the model will converge to an "averaged" representation, thus losing most of the details and clarity. Especially for regular patterns (such as grids or stripes), due to the inconsistency between viewpoints, these patterns will completely disappear when converted to 3D. To overcome this challenge, the paper proposes the **ConsistDreamer** framework, which enhances the input of 2D diffusion models by introducing three collaborative strategies to make them 3D - aware and explicitly enforce 3D consistency during the training process. These three strategies are: 1. **Structured Noise**: Generate consistent noise for each viewpoint instead of independently generated Gaussian noise. Specifically, generate and fix Gaussian noise on the surface of scene objects, and then render each viewpoint to obtain 2D noise for all subsequent diffusion generations of the image of that viewpoint. This ensures that the denoising process starts from consistent noise, thus helping to finally generate consistent images. 2. **Surrounding Views**: Construct a composite image containing a main viewpoint and multiple reference viewpoints as the input of the 2D diffusion model. This not only enriches the context information of the scene but also allows multiple viewpoints to be edited simultaneously. The main viewpoint occupies a large proportion, while the reference viewpoints provide additional context information to help generate more consistent editing results. 3. **Self - Supervised Consistency - Enforcing Training**: Introduce self - supervised consistency - enforcing training during each scene - editing process. Through depth - guided pixel correspondences, generate 3D - consistent multi - view images as self - supervised targets. In addition, aggregate these pixels through a weighted - average process to ensure multi - view consistency. The training process also includes VGG perceptual loss and stylization loss to maintain the original style and avoid smoothing. Through these strategies, ConsistDreamer can generate multi - view images with high consistency, thereby achieving high - fidelity instruction - guided scene editing in complex large - scale indoor scenes. Compared with existing methods, ConsistDreamer shows significant improvements in the clarity and details of the editing results, especially successfully editing complex patterns (such as grids or stripes).