Abstract:Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient control variate to reduce variance in score estimates, thereby enhancing both output quality and optimization stability. Our experiments demonstrate that the ISD loss consistently achieves visually coherent, high-quality outputs and improves optimization speed compared to prior text-to-3D methods, as demonstrated through both qualitative and quantitative evaluations on the T3Bench benchmark suite.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the problems of over - smoothing and low - quality output existing in the current Score Distillation Sampling (SDS) - based methods when generating 3D models. Specifically, these problems stem from the mode - seeking behavior in current methods, that is, the scores used to update the model oscillate among multiple modes, leading to unstable optimization and a decline in output quality. To solve these problems, the authors introduce a new Image Prompt Score Distillation loss (ISD), which uses reference images to guide the text - to - 3D optimization process towards a specific mode. The ISD loss is achieved by using IP - Adapter (a lightweight adapter for integrating image - prompt capabilities into text - to - image diffusion models) as a mode - selection module. In addition, this method also introduces an improved control variable term to reduce the variance in score estimation, thereby improving output quality and optimization stability. The following are the specific problems and solutions proposed in the paper: ### 1. Over - smoothing and low - quality output caused by mode - seeking behavior **Problem**: - Existing SDS methods are prone to over - smoothing and low - quality output when generating 3D models. - Scores oscillate among multiple modes, resulting in unstable optimization. **Solution**: - Introduce the ISD loss and use reference images to guide the optimization process towards a specific mode. - Use IP - Adapter as a mode - selection module to ensure the stability of the optimization process and improve output quality. ### 2. High variance in score estimation **Problem**: - The high variance in score estimation affects the quality and stability of the generated results. **Solution**: - Introduce an improved control variable term and reduce the variance in score estimation through more effective control variables. - Utilize the positive correlation between the predicted noise of IP - Adapter and the original noise to further reduce the variance. ### 3. Multi - view consistency problem (Janus problem) **Problem**: - Using image - prompts to guide score distillation may cause the generated 3D model to be biased towards the reference image view, resulting in the Janus problem. **Solution**: - Combine multi - view regularization techniques (such as MVDream) to ensure that the generated 3D model remains consistent from different views. - Balance reference image alignment and multi - view consistency through joint optimization of the ISD loss and multi - view regularization loss. ### Formula representation To better understand these methods, the following are some key formulas mentioned in the paper: 1. **SDS loss formula**: \[ L_{\text{SDS}}=\mathbb{E}_{t, \epsilon, c}\left[\omega(t)\left(\epsilon_{\phi}(x_{t}, t, y)-\epsilon\right) \frac{\partial g(\theta, c)}{\partial \theta}\right] \] 2. **ISD loss formula**: \[ L_{\text{ISD}}=\mathbb{E}_{t, \epsilon, c}\left[\omega(t)\left(\epsilon_{\text{IP}}(x_{t}, t, y, x_{\text{ref}})-\epsilon_{\text{SD}}(x_{t}, t, y)\right) \frac{\partial g(\theta, c)}{\partial \theta}\right] \] 3. **Multi - view regularization loss formula**: \[ L_{\text{SDS - MVD}}=\mathbb{E}_{t, \epsilon, c}\left[\omega(t)\left(\epsilon_{\text{MVD}}(x_{t}, t, y)-\epsilo\right) \]

ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts

VividDreamer: Invariant Score Distillation For Hyper-Realistic Text-to-3D Generation

Dream-in-Style: Text-to-3D Generation using Stylized Score Distillation

Score Distillation Sampling with Learned Manifold Corrective

Taming Mode Collapse in Score Distillation for Text-to-3D Generation

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

A Quantitative Evaluation of Score Distillation Sampling Based Text-to-3D

Diverse Score Distillation

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation

Stable Score Distillation for High-Quality 3D Generation

Score Distillation via Reparametrized DDIM

StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D

Efficient Text-Guided 3D-Aware Portrait Generation with Score Distillation Sampling on Distribution

SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity

Retrieval-Augmented Score Distillation for Text-to-3D Generation

ExactDreamer: High-Fidelity Text-to-3D Content Creation via Exact Score Matching

DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation

Rethinking Score Distillation as a Bridge Between Image Distributions

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching