ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts

Uy Dieu Tran,Minh Luu,Phong Ha Nguyen,Khoi Nguyen,Binh-Son Hua
2024-11-27
Abstract:Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient control variate to reduce variance in score estimates, thereby enhancing both output quality and optimization stability. Our experiments demonstrate that the ISD loss consistently achieves visually coherent, high-quality outputs and improves optimization speed compared to prior text-to-3D methods, as demonstrated through both qualitative and quantitative evaluations on the T3Bench benchmark suite.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the problems of over - smoothing and low - quality output existing in the current Score Distillation Sampling (SDS) - based methods when generating 3D models. Specifically, these problems stem from the mode - seeking behavior in current methods, that is, the scores used to update the model oscillate among multiple modes, leading to unstable optimization and a decline in output quality. To solve these problems, the authors introduce a new Image Prompt Score Distillation loss (ISD), which uses reference images to guide the text - to - 3D optimization process towards a specific mode. The ISD loss is achieved by using IP - Adapter (a lightweight adapter for integrating image - prompt capabilities into text - to - image diffusion models) as a mode - selection module. In addition, this method also introduces an improved control variable term to reduce the variance in score estimation, thereby improving output quality and optimization stability. The following are the specific problems and solutions proposed in the paper: ### 1. Over - smoothing and low - quality output caused by mode - seeking behavior **Problem**: - Existing SDS methods are prone to over - smoothing and low - quality output when generating 3D models. - Scores oscillate among multiple modes, resulting in unstable optimization. **Solution**: - Introduce the ISD loss and use reference images to guide the optimization process towards a specific mode. - Use IP - Adapter as a mode - selection module to ensure the stability of the optimization process and improve output quality. ### 2. High variance in score estimation **Problem**: - The high variance in score estimation affects the quality and stability of the generated results. **Solution**: - Introduce an improved control variable term and reduce the variance in score estimation through more effective control variables. - Utilize the positive correlation between the predicted noise of IP - Adapter and the original noise to further reduce the variance. ### 3. Multi - view consistency problem (Janus problem) **Problem**: - Using image - prompts to guide score distillation may cause the generated 3D model to be biased towards the reference image view, resulting in the Janus problem. **Solution**: - Combine multi - view regularization techniques (such as MVDream) to ensure that the generated 3D model remains consistent from different views. - Balance reference image alignment and multi - view consistency through joint optimization of the ISD loss and multi - view regularization loss. ### Formula representation To better understand these methods, the following are some key formulas mentioned in the paper: 1. **SDS loss formula**: \[ L_{\text{SDS}}=\mathbb{E}_{t, \epsilon, c}\left[\omega(t)\left(\epsilon_{\phi}(x_{t}, t, y)-\epsilon\right) \frac{\partial g(\theta, c)}{\partial \theta}\right] \] 2. **ISD loss formula**: \[ L_{\text{ISD}}=\mathbb{E}_{t, \epsilon, c}\left[\omega(t)\left(\epsilon_{\text{IP}}(x_{t}, t, y, x_{\text{ref}})-\epsilon_{\text{SD}}(x_{t}, t, y)\right) \frac{\partial g(\theta, c)}{\partial \theta}\right] \] 3. **Multi - view regularization loss formula**: \[ L_{\text{SDS - MVD}}=\mathbb{E}_{t, \epsilon, c}\left[\omega(t)\left(\epsilon_{\text{MVD}}(x_{t}, t, y)-\epsilo\right) \]