Abstract:Inferring 3D structures from sparse, unposed observations is challenging due to its unconstrained nature. Recent methods propose to predict implicit representations directly from unposed inputs in a data-driven manner, achieving promising results. However, these methods do not utilize geometric priors and cannot hallucinate the appearance of unseen regions, thus making it challenging to reconstruct fine geometric and textural details. To tackle this challenge, our key idea is to reformulate this ill-posed problem as conditional novel view synthesis, aiming to generate complete observations from limited input views to facilitate reconstruction. With complete observations, the poses of the input views can be easily recovered and further used to optimize the reconstructed object. To this end, we propose a novel pipeline Pragmatist. First, we generate a complete observation of the object via a multiview conditional diffusion model. Then, we use a feed-forward large reconstruction model to obtain the reconstructed mesh. To further improve the reconstruction quality, we recover the poses of input views by inverting the obtained 3D representations and further optimize the texture using detailed input views. Unlike previous approaches, our pipeline improves reconstruction by efficiently leveraging unposed inputs and generative priors, circumventing the direct resolution of highly ill-posed problems. Extensive experiments show that our approach achieves promising performance in several benchmarks.

What problem does this paper attempt to address?

This paper attempts to solve the problem of reconstructing high - fidelity 3D structures from sparse and unposed two - dimensional views. Specifically, the article focuses on how to infer the complete 3D structure and texture details of an object when only partial observations are available. This problem is challenging mainly because: 1. **Lack of geometric prior information**: Traditional methods cannot speculate on the appearance of unseen areas, so it is difficult to reconstruct fine - grained geometric and texture details. 2. **Unaligned input views**: The input images do not have explicit camera pose information, which makes direct reconstruction very difficult. ### Overview of the solution To solve these problems, the authors propose a new framework named Pragmatist. Its core idea is to re - define this ill - posed problem as conditional novel view synthesis. By generating complete observation data, the reconstruction process is simplified. The specific steps are as follows: 1. **Generate new views with a multi - view conditional diffusion model**: - Use a multi - view conditional diffusion model based on the self - attention mechanism to generate additional consistent observation data. - These new views are generated in the canonical coordinate system to ensure 3D consistency. 2. **Generate 3D meshes with a feed - forward reconstruction model**: - Use the generated multi - view images to obtain 3D meshes through a feed - forward large - scale reconstruction model. - This model can efficiently extract high - quality 3D meshes from the generated images. 3. **Optimize the pose and texture of the input views**: - Recover the pose of the input views by reverse - inferring the existing 3D representation. - Use high - resolution input views to further optimize the surface texture to recover the detailed geometric structure and material texture. ### Main contributions - **Innovative problem - solving method**: Transform the 3D reconstruction problem of sparse and unaligned views into a conditional novel view synthesis task, thereby generating complete observation data. - **Flexible multi - view generation model**: Propose a model that can generate multi - view images in the canonical coordinate system without explicit camera pose information. - **Improve reconstruction quality**: By combining the generation prior and geometric constraints, the reconstruction quality is improved, especially when dealing with complex geometric structures and real - world textures. ### Experimental results Experiments show that Pragmatist achieves significantly better results than existing methods in multiple benchmark tests, especially when dealing with sparse and unaligned input views, demonstrating higher reconstruction accuracy and better visual effects. ### Formula presentation Some of the formulas involved in the article include: - Probability distribution of conditional new view generation: \[ I_{\text{tgt}} \sim p(I_{\text{tgt}} | I_{\text{cond}}, P_{\text{cond}}, P_{\text{tgt}}) \] - Volume - rendering training loss function: \[ L_{\text{vol}} = L_{\text{rgb}} + \lambda_p L_{\text{LPIPS}} \] - Surface - rendering training loss function: \[ L_{\text{geo}} = L_{\text{rgb}} + \lambda_p L_{\text{LPIPS}} + \lambda_d L_{\text{d}} + \lambda_m L_{\text{m}} + \lambda_r L_{\text{o}} \] These formulas help describe the key calculation steps in the model, ensuring the accuracy and interpretability of the method.

Pragmatist: Multiview Conditional Diffusion Models for High-Fidelity 3D Reconstruction from Unposed Sparse Views

SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis

Data-Driven 3D Reconstruction of Dressed Humans From Sparse Views

DreamSparse: Escaping from Plato's Cave with 2D Frozen Diffusion Model Given Sparse Views

How to Use Diffusion Priors under Sparse Views?

Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors

The More You See in 2D, the More You Perceive in 3D

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View

Explicit 3D Reconstruction from Images with Dynamic Graph Learning and Rendering-Guided Diffusion

Gaussian Scenes: Pose-Free Sparse-View Scene Reconstruction using Depth-Enhanced Diffusion Priors

Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior

Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis

ReconFusion: 3D Reconstruction with Diffusion Priors

Exploiting Priors from 3D Diffusion Models for RGB-Based One-Shot View Planning

SpaRP: Fast 3D Object Reconstruction and Pose Estimation from Sparse Views

Deceptive-NeRF/3DGS: Diffusion-Generated Pseudo-Observations for High-Quality Sparse-View Reconstruction

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis