Yuan Liu,Cheng Lin,Zijiao Zeng,Xiaoxiao Long,Lingjie Liu,Taku Komura,Wenping Wang
Abstract:In this paper, we present a novel diffusion model called that generates multiview-consistent images from a single-view image. Using pretrained large-scale 2D diffusion models, recent work Zero123 demonstrates the ability to generate plausible novel views from a single-view image of an object. However, maintaining consistency in geometry and colors for the generated images remains a challenge. To address this issue, we propose a synchronized multiview diffusion model that models the joint probability distribution of multiview images, enabling the generation of multiview-consistent images in a single reverse process. SyncDreamer synchronizes the intermediate states of all the generated images at every step of the reverse process through a 3D-aware feature attention mechanism that correlates the corresponding features across different views. Experiments show that SyncDreamer generates images with high consistency across different views, thus making it well-suited for various 3D generation tasks such as novel-view-synthesis, text-to-3D, and image-to-3D.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: generating multi - view - consistent images from single - view images, that is, given a single - view image of an object, generate images from multiple views and ensure that these images maintain consistency in geometry and color. This problem is of great significance in the fields of computer vision and graphics, because existing methods have difficulty maintaining the consistency of geometric structures and colors when generating multi - view images.
### Problem Background
1. **Limited 3D Information**: Although neural networks have made great progress in extracting 3D information from images (such as Yao et al., 2018; Tewari et al., 2020), generating multi - view - consistent images from single - view images remains a challenge because the 3D information in the image is very limited.
2. **Successes and Limitations of Diffusion Models**: Diffusion models (such as Rombach et al., 2022; Ho et al., 2020) have achieved great success in 2D image generation, but directly training general - purpose 3D diffusion models usually requires a large amount of 3D data, and the existing 3D datasets are not sufficient to capture the complexity of arbitrary 3D shapes.
3. **Shortcomings of Existing Methods**: Some methods generate 3D models by distilling pre - trained text - to - image diffusion models, but this requires text inversion (Gal et al., 2022), and it takes a long time to generate a single shape and the parameter adjustment is cumbersome. In addition, it is difficult to represent the details of an image (such as category, appearance, pose) using a single word embedding, resulting in a decline in the quality of 3D shape reconstruction.
### Solutions Proposed in the Paper
To overcome the above problems, the paper proposes a new framework named SyncDreamer, which aims to generate multi - view - consistent images from single - view images. Specifically:
- **Synchronous Multi - view Diffusion Model**: SyncDreamer ensures that the generated multi - view images are geometrically and color - consistent by introducing a synchronization mechanism to synchronize the intermediate states of all generated images during the reverse diffusion process.
- **3D - Aware Feature Attention Mechanism**: By applying a 3D - aware feature attention mechanism in each denoising step, SyncDreamer can correlate the corresponding features between different views, thereby improving multi - view consistency.
- **Efficient 3D Reconstruction**: The generated multi - view - consistent images can be directly used for 3D reconstruction methods such as NeRF or NeuS without using special loss functions, simplifying the 3D reconstruction process.
### Experimental Results
Experiments show that SyncDreamer outperforms the baseline methods on the Google Scanned Object dataset, can generate more consistent images and reconstruct better 3D shapes. In addition, SyncDreamer also supports multiple - style 2D inputs (such as cartoons, sketches, ink - wash paintings, oil paintings), verifying its effectiveness in promoting 2D images to 3D.
In summary, SyncDreamer solves the problem of generating multi - view - consistent images from single - view images by introducing the synchronous multi - view diffusion model and the 3D - aware feature attention mechanism, providing a new and effective tool for 3D reconstruction tasks.