Abstract:This paper proposes ConsistDreamer - a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency, thus enabling high-fidelity instruction-guided scene editing. To overcome the fundamental limitation of missing 3D consistency in 2D diffusion models, our key insight is to introduce three synergetic strategies that augment the input of the 2D diffusion model to become 3D-aware and to explicitly enforce 3D consistency during the training process. Specifically, we design surrounding views as context-rich input for the 2D diffusion model, and generate 3D-consistent, structured noise instead of image-independent noise. Moreover, we introduce self-supervised consistency-enforcing training within the per-scene editing procedure. Extensive evaluation shows that our ConsistDreamer achieves state-of-the-art performance for instruction-guided scene editing across various scenes and editing instructions, particularly in complicated large-scale indoor scenes from ScanNet++, with significantly improved sharpness and fine-grained textures. Notably, ConsistDreamer stands as the first work capable of successfully editing complex (e.g., plaid/checkered) patterns. Our project page is at <a class="link-external link-http" href="http://immortalco.github.io/ConsistDreamer" rel="external noopener nofollow">this http URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the lack of 3D consistency when using 2D diffusion models in 3D scene editing. Specifically, existing 2D diffusion models may produce inconsistent results in color and shape when editing between different viewpoints, especially when dealing with complex large - scale indoor scenes, and this inconsistency is more obvious. For example, a person may be edited to be wearing a red shirt in one viewpoint, but wearing a green shirt in another viewpoint. When these inconsistent images are used to train Neural Radiance Fields (NeRF), the model will converge to an "averaged" representation, thus losing most of the details and clarity. Especially for regular patterns (such as grids or stripes), due to the inconsistency between viewpoints, these patterns will completely disappear when converted to 3D. To overcome this challenge, the paper proposes the **ConsistDreamer** framework, which enhances the input of 2D diffusion models by introducing three collaborative strategies to make them 3D - aware and explicitly enforce 3D consistency during the training process. These three strategies are: 1. **Structured Noise**: Generate consistent noise for each viewpoint instead of independently generated Gaussian noise. Specifically, generate and fix Gaussian noise on the surface of scene objects, and then render each viewpoint to obtain 2D noise for all subsequent diffusion generations of the image of that viewpoint. This ensures that the denoising process starts from consistent noise, thus helping to finally generate consistent images. 2. **Surrounding Views**: Construct a composite image containing a main viewpoint and multiple reference viewpoints as the input of the 2D diffusion model. This not only enriches the context information of the scene but also allows multiple viewpoints to be edited simultaneously. The main viewpoint occupies a large proportion, while the reference viewpoints provide additional context information to help generate more consistent editing results. 3. **Self - Supervised Consistency - Enforcing Training**: Introduce self - supervised consistency - enforcing training during each scene - editing process. Through depth - guided pixel correspondences, generate 3D - consistent multi - view images as self - supervised targets. In addition, aggregate these pixels through a weighted - average process to ensure multi - view consistency. The training process also includes VGG perceptual loss and stylization loss to maintain the original style and avoid smoothing. Through these strategies, ConsistDreamer can generate multi - view images with high consistency, thereby achieving high - fidelity instruction - guided scene editing in complex large - scale indoor scenes. Compared with existing methods, ConsistDreamer shows significant improvements in the clarity and details of the editing results, especially successfully editing complex patterns (such as grids or stripes).

ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

ProEdit: Simple Progression is All You Need for High-Quality 3D Scene Editing

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion

Efficient-NeRF2NeRF: Streamlining Text-Driven 3D Editing with Multiview Correspondence-Enhanced Diffusion Models

SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

Enhanced 3D Generation by 2D Editing

Text-driven Editing of 3D Scenes without Retraining

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D

Mixed Diffusion for 3D Indoor Scene Synthesis

Edit3D: Elevating 3D Scene Editing with Attention-Driven Multi-Turn Interactivity

Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model

StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D

Creating High-quality 3D Content by Bridging the Gap Between Text-to-2D and Text-to-3D Generation

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

FocalDreamer: Text-driven 3D Editing via Focal-fusion Assembly