Pragmatist: Multiview Conditional Diffusion Models for High-Fidelity 3D Reconstruction from Unposed Sparse Views

Songchun Zhang,Chunhui Zhao
2024-12-11
Abstract:Inferring 3D structures from sparse, unposed observations is challenging due to its unconstrained nature. Recent methods propose to predict implicit representations directly from unposed inputs in a data-driven manner, achieving promising results. However, these methods do not utilize geometric priors and cannot hallucinate the appearance of unseen regions, thus making it challenging to reconstruct fine geometric and textural details. To tackle this challenge, our key idea is to reformulate this ill-posed problem as conditional novel view synthesis, aiming to generate complete observations from limited input views to facilitate reconstruction. With complete observations, the poses of the input views can be easily recovered and further used to optimize the reconstructed object. To this end, we propose a novel pipeline Pragmatist. First, we generate a complete observation of the object via a multiview conditional diffusion model. Then, we use a feed-forward large reconstruction model to obtain the reconstructed mesh. To further improve the reconstruction quality, we recover the poses of input views by inverting the obtained 3D representations and further optimize the texture using detailed input views. Unlike previous approaches, our pipeline improves reconstruction by efficiently leveraging unposed inputs and generative priors, circumventing the direct resolution of highly ill-posed problems. Extensive experiments show that our approach achieves promising performance in several benchmarks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of reconstructing high - fidelity 3D structures from sparse and unposed two - dimensional views. Specifically, the article focuses on how to infer the complete 3D structure and texture details of an object when only partial observations are available. This problem is challenging mainly because: 1. **Lack of geometric prior information**: Traditional methods cannot speculate on the appearance of unseen areas, so it is difficult to reconstruct fine - grained geometric and texture details. 2. **Unaligned input views**: The input images do not have explicit camera pose information, which makes direct reconstruction very difficult. ### Overview of the solution To solve these problems, the authors propose a new framework named Pragmatist. Its core idea is to re - define this ill - posed problem as conditional novel view synthesis. By generating complete observation data, the reconstruction process is simplified. The specific steps are as follows: 1. **Generate new views with a multi - view conditional diffusion model**: - Use a multi - view conditional diffusion model based on the self - attention mechanism to generate additional consistent observation data. - These new views are generated in the canonical coordinate system to ensure 3D consistency. 2. **Generate 3D meshes with a feed - forward reconstruction model**: - Use the generated multi - view images to obtain 3D meshes through a feed - forward large - scale reconstruction model. - This model can efficiently extract high - quality 3D meshes from the generated images. 3. **Optimize the pose and texture of the input views**: - Recover the pose of the input views by reverse - inferring the existing 3D representation. - Use high - resolution input views to further optimize the surface texture to recover the detailed geometric structure and material texture. ### Main contributions - **Innovative problem - solving method**: Transform the 3D reconstruction problem of sparse and unaligned views into a conditional novel view synthesis task, thereby generating complete observation data. - **Flexible multi - view generation model**: Propose a model that can generate multi - view images in the canonical coordinate system without explicit camera pose information. - **Improve reconstruction quality**: By combining the generation prior and geometric constraints, the reconstruction quality is improved, especially when dealing with complex geometric structures and real - world textures. ### Experimental results Experiments show that Pragmatist achieves significantly better results than existing methods in multiple benchmark tests, especially when dealing with sparse and unaligned input views, demonstrating higher reconstruction accuracy and better visual effects. ### Formula presentation Some of the formulas involved in the article include: - Probability distribution of conditional new view generation: \[ I_{\text{tgt}} \sim p(I_{\text{tgt}} | I_{\text{cond}}, P_{\text{cond}}, P_{\text{tgt}}) \] - Volume - rendering training loss function: \[ L_{\text{vol}} = L_{\text{rgb}} + \lambda_p L_{\text{LPIPS}} \] - Surface - rendering training loss function: \[ L_{\text{geo}} = L_{\text{rgb}} + \lambda_p L_{\text{LPIPS}} + \lambda_d L_{\text{d}} + \lambda_m L_{\text{m}} + \lambda_r L_{\text{o}} \] These formulas help describe the key calculation steps in the model, ensuring the accuracy and interpretability of the method.