Abstract:Text-guided diffusion models have shown superior performance in image/video generation and editing. While few explorations have been performed in 3D scenarios. In this paper, we discuss three fundamental and interesting problems on this topic. First, we equip text-guided diffusion models to achieve 3D-consistent generation. Specifically, we integrate a NeRF-like neural field to generate low-resolution coarse results for a given camera view. Such results can provide 3D priors as condition information for the following diffusion process. During denoising diffusion, we further enhance the 3D consistency by modeling cross-view correspondences with a novel two-stream (corresponding to two different views) asynchronous diffusion process. Second, we study 3D local editing and propose a two-step solution that can generate 360-degree manipulated results by editing an object from a single view. Step 1, we propose to perform 2D local editing by blending the predicted noises. Step 2, we conduct a noise-to-text inversion process that maps 2D blended noises into the view-independent text embedding space. Once the corresponding text embedding is obtained, 360-degree images can be generated. Last but not least, we extend our model to perform one-shot novel view synthesis by fine-tuning on a single image, firstly showing the potential of leveraging text guidance for novel view synthesis. Extensive experiments and various applications show the prowess of our 3DDesigner. The project page is available at <a class="link-external link-https" href="https://3ddesigner-diffusion.github.io/" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to generate and edit realistic 3D objects using text - guided diffusion models in three - dimensional (3D) scenes. Specifically, the authors explore three core issues:
1. **Achieving 3D - consistent generation**: The authors propose a method that combines a NeRF (Neural Radiance Field) conditional module and a two - stream asynchronous diffusion module to ensure that the generated 3D objects are consistent from different viewpoints. The NeRF module generates low - resolution preliminary results as conditional information, and the two - stream asynchronous diffusion module further enhances 3D consistency by modeling cross - view correspondences.
2. **3D local editing**: The authors propose a two - step method to achieve editing 3D objects from a single view and generate 360° manipulation results. The first step is to perform 2D local editing by mixing predicted noise to edit specific areas; the second step is to map the 2D mixed noise to a view - independent text - embedding space, thereby generating 360° images.
3. **One - shot novel view synthesis based on a single image**: The authors find that their model can be easily extended to the one - shot novel view synthesis task by fine - tuning a single image, demonstrating the potential of text - guidance in novel view synthesis.
### Formula presentation
- **NeRF implicit representation**:
\[
\sigma(t), c(t)=\text{MLP}(r(t), d, y_c)
\]
where \(r(t) = o+td\), \(o\) and \(d\) are the origin and direction of the ray respectively, and \(y_c\) is the coarse - text embedding.
- **Volume rendering**:
\[
x_c(r)=\int_{t_n}^{t_f}T(t)\sigma(t)c(t)dt
\]
where \(T(t)=\exp\left(-\int_{t_n}^t\sigma(s)ds\right)\), and \(t_n\) and \(t_f\) are the near and far boundaries respectively.
- **Diffusion process**:
\[
q(x_1^t|x_1^0):=\mathcal{N}\left(x_1^t; \sqrt{\bar{\alpha}_1^t}x_1^0,(1 - \bar{\alpha}_1^t)I\right)
\]
\[
q(x_2^t|x_2^0):=\mathcal{N}\left(x_2^t; \sqrt{\bar{\alpha}_2^t}x_2^0,(1 - \bar{\alpha}_2^t)I\right)
\]
- **Posterior distribution**:
\[
p_\theta(x_1^{t_1 - 1}|x_1^{t_1},x_1^c,x_2^{t_2},x_2^c,y):=\mathcal{N}\left(\mu_\theta(x_1^{t_1},x_1^c,x_2^{t_2},x_2^c,y),\Sigma_\theta(x_1^{t_1},x_1^c,x_2^{t_2},x_2^c,y)\right)
\]
- **Optimization objective**:
\[
L=\mathbb{E}_{y,x_1^0,x_2^0,t_1,t_2,\epsilon}\left\|\epsilon_\theta(x_1^{t_1},x_1^c,x_2^{t_2},x_2^c,y)-\epsilon\right\|
\]
These formulas show how to generate and edit 3D objects through text - guided diffusion models and ensure their 3D consistency from different viewpoints.