Abstract:In recent times, automatic text-to-3D content creation has made significant progress, driven by the development of pretrained 2D diffusion models. Existing text-to-3D methods typically optimize the 3D representation to ensure that the rendered image aligns well with the given text, as evaluated by the pretrained 2D diffusion model. Nevertheless, a substantial domain gap exists between 2D images and 3D assets, primarily attributed to variations in camera-related attributes and the exclusive presence of foreground objects. Consequently, employing 2D diffusion models directly for optimizing 3D representations may lead to suboptimal outcomes. To address this issue, we present X-Dreamer, a novel approach for high-quality text-to-3D content creation that effectively bridges the gap between text-to-2D and text-to-3D synthesis. The key components of X-Dreamer are two innovative designs: Camera-Guided Low-Rank Adaptation (CG-LoRA) and Attention-Mask Alignment (AMA) Loss. CG-LoRA dynamically incorporates camera information into the pretrained diffusion models by employing camera-dependent generation for trainable parameters. This integration enhances the alignment between the generated 3D assets and the camera's perspective. AMA loss guides the attention map of the pretrained diffusion model using the binary mask of the 3D object, prioritizing the creation of the foreground object. This module ensures that the model focuses on generating accurate and detailed foreground objects. Extensive evaluations demonstrate the effectiveness of our proposed method compared to existing text-to-3D approaches. Our project webpage: <a class="link-external link-https" href="https://xmu-xiaoma666.github.io/Projects/X-Dreamer/" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to create high - quality 3D content by bridging the domain gap between text - to - 2D image generation (text - to - 2D) and text - to - 3D content generation (text - to - 3D)**. Specifically, existing text - to - 3D generation methods usually rely on pre - trained 2D diffusion models to optimize 3D representations to ensure that the rendered images are aligned with the given text. However, there is a significant domain gap between 2D images and 3D assets, mainly due to changes in camera - related properties and the presence of foreground objects. Therefore, directly using 2D diffusion models to optimize 3D representations may lead to sub - optimal results. To solve this problem, the authors propose a new method named **X - Dreamer**, which aims to bridge the domain gap between text - to - 2D and text - to - 3D generation through the following two innovative designs: 1. **Camera - Guided Low - Rank Adaptation (CG - LoRA)**: By dynamically introducing camera information, the pre - trained 2D diffusion model is made sensitive to camera parameters. 2. **Attention - Mask Alignment (AMA) Loss**: Use the binary masks of 3D objects to guide the attention maps of the pre - trained diffusion model and give priority to generating foreground objects. These designs enable X - Dreamer to make significant progress in generating high - quality 3D content and show better performance than existing methods in multiple experiments. ### Formula Summary - **MSE Loss**: \[ L_{\text{MSE}}=\frac{1}{N} \sum_{i = 1}^{N}(s(p_i; \Phi_{\text{dm}})-SDF(p_i))^2 \] - **SDS Loss Gradient**: \[ \nabla_{\Phi_{\text{dm}}} L_{\text{SDS}}=E_{t, \epsilon}\left[w(t)\left(\hat{\epsilon}_{\Theta}(n_t; y, t)-\epsilon\right) \frac{\partial n}{\partial \Phi_{\text{dm}}}\right] \] - **AMA Loss**: \[ L_{\text{AMA}}=\frac{1}{L} \sum_{i = 1}^{L}|a_i-\eta(m)| \] - **CG - LoRA Forward Propagation Formula**: \[ y = xW+[x A_{\text{txt}}; x A_{\text{cam}}]B \] Through these formulas and designs, X - Dreamer can better handle the domain gap problem in text - to - 3D generation tasks, thereby generating higher - quality 3D content.

X-Dreamer: Creating High-quality 3D Content by Bridging the Domain Gap Between Text-to-2D and Text-to-3D Generation

Creating High-quality 3D Content by Bridging the Gap Between Text-to-2D and Text-to-3D Generation

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

PlacidDreamer: Advancing Harmony in Text-to-3D Generation

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D

MetaDreamer: Efficient Text-to-3D Creation With Disentangling Geometry and Texture

UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation

Dreamer XL: Towards High-Resolution Text-to-3D Generation via Trajectory Score Matching

OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control

Magic3D: High-Resolution Text-to-3D Content Creation

ExactDreamer: High-Fidelity Text-to-3D Content Creation via Exact Score Matching

DreamReward: Text-to-3D Generation with Human Preference

EucliDreamer: Fast and High-Quality Texturing for 3D Models with Depth-Conditioned Stable Diffusion

MVDream: Multi-view Diffusion for 3D Generation

ControlDreamer: Blending Geometry and Style in Text-to-3D