Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

Yuanhao Cai,He Zhang,Kai Zhang,Yixun Liang,Mengwei Ren,Fujun Luan,Qing Liu,Soo Ye Kim,Jianming Zhang,Zhifei Zhang,Yuqian Zhou,Zhe Lin,Alan Yuille
2024-11-22
Abstract:Existing feed-forward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric prompt images. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object and scene generation from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generalization ability of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that our method enjoys better generation quality (2.20 dB higher in PSNR and 23.25 lower in FID) and over 5x faster speed (~6s on an A100 GPU) than SOTA methods. The user study and text-to-3D applications also reveals the practical values of our method. Our Project page at <a class="link-external link-https" href="https://caiyuanhao1998.github.io/project/DiffusionGS/" rel="external noopener nofollow">this https URL</a> shows the video and interactive generation results.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges existing in the existing feed - forward image - to - 3D generation methods when dealing with single - view inputs: 1. **3D Consistency Problem**: Existing multi - view diffusion models cannot guarantee 3D consistency during the generation process, resulting in easy collapse when changing the direction of the prompt view. These methods mainly deal with prompt images centered on objects and have insufficient support for complex scenes. 2. **Generation Quality and Speed**: Existing methods have limitations in generation quality and speed, especially when dealing with large - scale scenes. For example, the method based on tri - plane NeRF is difficult to scale to larger scenes due to the slow speed of volume rendering and limited resolution. 3. **Data Generalization Ability**: Current methods are mainly trained using object - centered datasets, which limits the generalization ability of the model, especially the insufficient support for large - scale scene generation. To address these problems, the author proposes a new single - stage 3D Gaussian point - cloud diffusion model (DiffusionGS), which can generate 3D objects and scenes from a single view and has the following characteristics: - **3D Consistency**: By predicting multi - view pixel - aligned Gaussian primitives at each time step, DiffusionGS can enforce view consistency of the generated content, thus enabling robust generation under prompt views in any direction. - **Fast Inference**: Utilizing highly parallel rasterization and a scalable imaging range, the inference speed of DiffusionGS on a single A100 GPU is approximately 6 seconds. - **Hybrid Training Strategy**: To improve the generalization ability and generation quality of the model, the author has developed a scene - object hybrid training strategy to adapt to different types of 3D data by controlling the distribution of selected views, camera conditions, Gaussian point clouds, and imaging depths. - **New Camera Pose Encoding Method**: A new camera pose encoding method - reference point Plücker coordinates (RPPC) - is designed to better perceive depth and 3D geometric structures. Experimental results show that DiffusionGS significantly outperforms existing methods in terms of generation quality and speed, especially with an improvement of 2.20 dB and 23.25 in PSNR and FID metrics respectively, while the inference speed is increased by more than 5 times.