Abstract:Advancements in text-to-image diffusion models have led to significant progress in fast 3D content creation. One common approach is to generate a set of multi-view images of an object, and then reconstruct it into a 3D model. However, this approach bypasses the use of a native 3D representation of the object and is hence prone to geometric artifacts and limited in controllability and manipulation capabilities. An alternative approach involves native 3D generative models that directly produce 3D representations. These models, however, are typically limited in their resolution, resulting in lower quality 3D objects. In this work, we bridge the quality gap between methods that directly generate 3D representations and ones that reconstruct 3D objects from multi-view images. We introduce a multi-view to multi-view diffusion model called Sharp-It, which takes a 3D consistent set of multi-view images rendered from a low-quality object and enriches its geometric details and texture. The diffusion model operates on the multi-view set in parallel, in the sense that it shares features across the generated views. A high-quality 3D model can then be reconstructed from the enriched multi-view set. By leveraging the advantages of both 2D and 3D approaches, our method offers an efficient and controllable method for high-quality 3D content creation. We demonstrate that Sharp-It enables various 3D applications, such as fast synthesis, editing, and controlled generation, while attaining high-quality assets.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the trade - off between quality and controllability in existing 3D content generation methods. Specifically: 1. **Methods for directly generating 3D models**: Although these methods provide higher controllability and editing capabilities, the quality of the generated 3D objects is low due to resolution limitations. 2. **Methods for reconstructing 3D objects from multi - view images**: This method can generate high - quality 3D objects, but bypasses the native 3D representation, resulting in geometric artifacts and limited controllability. To solve these problems, the paper proposes a multi - view - to - multi - view diffusion model named **Sharp - It**. This model aims to bridge the quality gap between these two methods by enhancing the geometric details and appearance features of low - quality 3D shapes. Specifically, Sharp - It solves the problem in the following ways: - **Input and output**: Sharp - It accepts a set of multi - view images rendered by low - quality 3D objects and generates high - quality multi - view images with fine geometric details and textures. - **Diffusion model operation**: The diffusion model processes the multi - view set in parallel and shares features among the generated views, thus ensuring consistency. - **High - quality 3D reconstruction**: Through the existing sparse - view feed - forward reconstruction method, a high - quality 3D model can be reconstructed from the enhanced multi - view set. By combining the advantages of 2D diffusion models and native 3D generation models, Sharp - It provides an efficient and controllable method for generating high - quality 3D content. Experimental results show that Sharp - It is superior to existing enhancement methods in both quality and efficiency. ### Formula presentation The formulas involved in the paper are mainly used to describe the loss function during the training process. For example, the training loss function \( L \) is defined as follows: \[ L=\mathbb{E}_{t, \epsilon \sim \mathcal{N}(0,1)}\left[\|v - v_{\theta}(x_{t}, x_{\text{Shap - E}}, c_{\text{prompt}})\|^{2}\right] \] where: - \( v_{\theta} \) represents the \( v \)-prediction of the model, parameterized by \( \theta \). - \( x_{t} \) is obtained by adding noise to \( x \), depending on the diffusion time step \( t \). - \( t \) and \( \epsilon \) are randomly sampled from the diffusion steps and Gaussian noise respectively. - \( v \) is defined as \( \alpha_{t}\epsilon-\sigma x \), where \( \alpha_{t} \) and \( \sigma \) are parameters of the noise scheduler. This formula ensures that Sharp - It can effectively learn how to generate high - quality 3D objects from low - quality multi - view images.

Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion

Envision3D: One Image to 3D with Anchor Views Interpolation

Generative Novel View Synthesis with 3D-Aware Diffusion Models

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

V3D: Video Diffusion Models are Effective 3D Generators

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

4Diffusion: Multi-view Video Diffusion Model for 4D Generation