Abstract:Text-guided diffusion models have shown superior performance in image/video generation and editing. While few explorations have been performed in 3D scenarios. In this paper, we discuss three fundamental and interesting problems on this topic. First, we equip text-guided diffusion models to achieve 3D-consistent generation. Specifically, we integrate a NeRF-like neural field to generate low-resolution coarse results for a given camera view. Such results can provide 3D priors as condition information for the following diffusion process. During denoising diffusion, we further enhance the 3D consistency by modeling cross-view correspondences with a novel two-stream (corresponding to two different views) asynchronous diffusion process. Second, we study 3D local editing and propose a two-step solution that can generate 360-degree manipulated results by editing an object from a single view. Step 1, we propose to perform 2D local editing by blending the predicted noises. Step 2, we conduct a noise-to-text inversion process that maps 2D blended noises into the view-independent text embedding space. Once the corresponding text embedding is obtained, 360-degree images can be generated. Last but not least, we extend our model to perform one-shot novel view synthesis by fine-tuning on a single image, firstly showing the potential of leveraging text guidance for novel view synthesis. Extensive experiments and various applications show the prowess of our 3DDesigner. The project page is available at <a class="link-external link-https" href="https://3ddesigner-diffusion.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to generate and edit realistic 3D objects using text - guided diffusion models in three - dimensional (3D) scenes. Specifically, the authors explore three core issues: 1. **Achieving 3D - consistent generation**: The authors propose a method that combines a NeRF (Neural Radiance Field) conditional module and a two - stream asynchronous diffusion module to ensure that the generated 3D objects are consistent from different viewpoints. The NeRF module generates low - resolution preliminary results as conditional information, and the two - stream asynchronous diffusion module further enhances 3D consistency by modeling cross - view correspondences. 2. **3D local editing**: The authors propose a two - step method to achieve editing 3D objects from a single view and generate 360° manipulation results. The first step is to perform 2D local editing by mixing predicted noise to edit specific areas; the second step is to map the 2D mixed noise to a view - independent text - embedding space, thereby generating 360° images. 3. **One - shot novel view synthesis based on a single image**: The authors find that their model can be easily extended to the one - shot novel view synthesis task by fine - tuning a single image, demonstrating the potential of text - guidance in novel view synthesis. ### Formula presentation - **NeRF implicit representation**: \[ \sigma(t), c(t)=\text{MLP}(r(t), d, y_c) \] where \(r(t) = o+td\), \(o\) and \(d\) are the origin and direction of the ray respectively, and \(y_c\) is the coarse - text embedding. - **Volume rendering**: \[ x_c(r)=\int_{t_n}^{t_f}T(t)\sigma(t)c(t)dt \] where \(T(t)=\exp\left(-\int_{t_n}^t\sigma(s)ds\right)\), and \(t_n\) and \(t_f\) are the near and far boundaries respectively. - **Diffusion process**: \[ q(x_1^t|x_1^0):=\mathcal{N}\left(x_1^t; \sqrt{\bar{\alpha}_1^t}x_1^0,(1 - \bar{\alpha}_1^t)I\right) \] \[ q(x_2^t|x_2^0):=\mathcal{N}\left(x_2^t; \sqrt{\bar{\alpha}_2^t}x_2^0,(1 - \bar{\alpha}_2^t)I\right) \] - **Posterior distribution**: \[ p_\theta(x_1^{t_1 - 1}|x_1^{t_1},x_1^c,x_2^{t_2},x_2^c,y):=\mathcal{N}\left(\mu_\theta(x_1^{t_1},x_1^c,x_2^{t_2},x_2^c,y),\Sigma_\theta(x_1^{t_1},x_1^c,x_2^{t_2},x_2^c,y)\right) \] - **Optimization objective**: \[ L=\mathbb{E}_{y,x_1^0,x_2^0,t_1,t_2,\epsilon}\left\|\epsilon_\theta(x_1^{t_1},x_1^c,x_2^{t_2},x_2^c,y)-\epsilon\right\| \] These formulas show how to generate and edit 3D objects through text - guided diffusion models and ensure their 3D consistency from different viewpoints.

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

Enhanced 3D Generation by 2D Editing

Diverse and Stable 2D Diffusion Guided Text to 3D Generation with Noise Recalibration

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

Efficient-NeRF2NeRF: Streamlining Text-Driven 3D Editing with Multiview Correspondence-Enhanced Diffusion Models

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Creating High-quality 3D Content by Bridging the Gap Between Text-to-2D and Text-to-3D Generation

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

Text-driven Editing of 3D Scenes without Retraining

X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

DreamDissector: Learning Disentangled Text-to-3D Generation from 2D Diffusion Priors

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation

Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data