Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

Yiftach Edelstein,Or Patashnik,Dana Cohen-Bar,Lihi Zelnik-Manor
2024-12-04
Abstract:Advancements in text-to-image diffusion models have led to significant progress in fast 3D content creation. One common approach is to generate a set of multi-view images of an object, and then reconstruct it into a 3D model. However, this approach bypasses the use of a native 3D representation of the object and is hence prone to geometric artifacts and limited in controllability and manipulation capabilities. An alternative approach involves native 3D generative models that directly produce 3D representations. These models, however, are typically limited in their resolution, resulting in lower quality 3D objects. In this work, we bridge the quality gap between methods that directly generate 3D representations and ones that reconstruct 3D objects from multi-view images. We introduce a multi-view to multi-view diffusion model called Sharp-It, which takes a 3D consistent set of multi-view images rendered from a low-quality object and enriches its geometric details and texture. The diffusion model operates on the multi-view set in parallel, in the sense that it shares features across the generated views. A high-quality 3D model can then be reconstructed from the enriched multi-view set. By leveraging the advantages of both 2D and 3D approaches, our method offers an efficient and controllable method for high-quality 3D content creation. We demonstrate that Sharp-It enables various 3D applications, such as fast synthesis, editing, and controlled generation, while attaining high-quality assets.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the trade - off between quality and controllability in existing 3D content generation methods. Specifically: 1. **Methods for directly generating 3D models**: Although these methods provide higher controllability and editing capabilities, the quality of the generated 3D objects is low due to resolution limitations. 2. **Methods for reconstructing 3D objects from multi - view images**: This method can generate high - quality 3D objects, but bypasses the native 3D representation, resulting in geometric artifacts and limited controllability. To solve these problems, the paper proposes a multi - view - to - multi - view diffusion model named **Sharp - It**. This model aims to bridge the quality gap between these two methods by enhancing the geometric details and appearance features of low - quality 3D shapes. Specifically, Sharp - It solves the problem in the following ways: - **Input and output**: Sharp - It accepts a set of multi - view images rendered by low - quality 3D objects and generates high - quality multi - view images with fine geometric details and textures. - **Diffusion model operation**: The diffusion model processes the multi - view set in parallel and shares features among the generated views, thus ensuring consistency. - **High - quality 3D reconstruction**: Through the existing sparse - view feed - forward reconstruction method, a high - quality 3D model can be reconstructed from the enhanced multi - view set. By combining the advantages of 2D diffusion models and native 3D generation models, Sharp - It provides an efficient and controllable method for generating high - quality 3D content. Experimental results show that Sharp - It is superior to existing enhancement methods in both quality and efficiency. ### Formula presentation The formulas involved in the paper are mainly used to describe the loss function during the training process. For example, the training loss function \( L \) is defined as follows: \[ L=\mathbb{E}_{t, \epsilon \sim \mathcal{N}(0,1)}\left[\|v - v_{\theta}(x_{t}, x_{\text{Shap - E}}, c_{\text{prompt}})\|^{2}\right] \] where: - \( v_{\theta} \) represents the \( v \)-prediction of the model, parameterized by \( \theta \). - \( x_{t} \) is obtained by adding noise to \( x \), depending on the diffusion time step \( t \). - \( t \) and \( \epsilon \) are randomly sampled from the diffusion steps and Gaussian noise respectively. - \( v \) is defined as \( \alpha_{t}\epsilon-\sigma x \), where \( \alpha_{t} \) and \( \sigma \) are parameters of the noise scheduler. This formula ensures that Sharp - It can effectively learn how to generate high - quality 3D objects from low - quality multi - view images.