2L3: Lifting Imperfect Generated 2D Images into Accurate 3D

Yizheng Chen,Rengan Xie,Qi Ye,Sen Yang,Zixuan Xie,Tianxiao Chen,Rong Li,Yuchi Huo
2024-01-29
Abstract:Reconstructing 3D objects from a single image is an intriguing but challenging problem. One promising solution is to utilize multi-view (MV) 3D reconstruction to fuse generated MV images into consistent 3D objects. However, the generated images usually suffer from inconsistent lighting, misaligned geometry, and sparse views, leading to poor reconstruction quality. To cope with these problems, we present a novel 3D reconstruction framework that leverages intrinsic decomposition guidance, transient-mono prior guidance, and view augmentation to cope with the three issues, respectively. Specifically, we first leverage to decouple the shading information from the generated images to reduce the impact of inconsistent lighting; then, we introduce mono prior with view-dependent transient encoding to enhance the reconstructed normal; and finally, we design a view augmentation fusion strategy that minimizes pixel-level loss in generated sparse views and semantic loss in augmented random views, resulting in view-consistent geometry and detailed textures. Our approach, therefore, enables the integration of a pre-trained MV image generator and a neural network-based volumetric signed distance function (SDF) representation for a single image to 3D object reconstruction. We evaluate our framework on various datasets and demonstrate its superior performance in both quantitative and qualitative assessments, signifying a significant advancement in 3D object reconstruction. Compared with the latest state-of-the-art method Syncdreamer~\cite{liu2023syncdreamer}, we reduce the Chamfer Distance error by about 36\% and improve PSNR by about 30\% .
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper focuses on how to reconstruct accurate 3D objects from imperfect 2D generated images. Existing multi-view 3D reconstruction methods often encounter problems such as inconsistent lighting, geometric misalignment, and sparse views when dealing with images generated by generative models, resulting in a decrease in reconstruction quality. To address these issues, the paper proposes a new 3D reconstruction framework that uses intrinsic decomposition guidance, single-frame transient normal priors, and view enhancement to solve these three problems respectively. Specifically, they first reduce the impact of inconsistent lighting by decoupling the shadow information. Then they introduce view-dependent transient encoding to enhance the reconstruction of normals. Finally, they design a view enhancement fusion strategy to reduce pixel-level loss and semantic loss in sparse and random views, thus achieving view-consistent geometric shapes and detailed textures. By combining a pretrained multi-view image generator with a neural network-based voxel SDF representation, the paper improves the performance of reconstructing 3D objects from single images, texts, or other input conditions. Experimental results show that their method outperforms existing techniques on multiple datasets, significantly reducing Chamfer Distance errors and improving Peak Signal-to-Noise Ratio (PSNR). In conclusion, the main contributions of the paper are as follows: 1. Proposing a multi-view 3D reconstruction method for imperfect generated images, which can be easily applied to 3D generation based on 2D image generation. 2. Introducing intrinsic decomposition technology to address inconsistent lighting, improve reconstruction quality, and obtain reflection components. 3. Improving the geometric details and consistency of 3D object generation through the use of normal prior models and single-frame transient geometry encoding. 4. Innovating the view enhancement scheme, generating semantic guidance through densely sampled random views to alleviate the problem of insufficient supervision caused by sparse views.