Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation

Qiao Yu,Xianzhi Li,Yuan Tang,Xu Han,Long Hu,Yixue Hao,Min Chen
2024-11-25
Abstract:Generating 3D meshes from a single image is an important but ill-posed task. Existing methods mainly adopt 2D multiview diffusion models to generate intermediate multiview images, and use the Large Reconstruction Model (LRM) to create the final meshes. However, the multiview images exhibit local inconsistencies, and the meshes often lack fidelity to the input image or look blurry. We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues, respectively. The appearance enhancement module deforms the 2D multiview images to realign misaligned pixels for better multiview consistency. The fidelity enhancement module deforms the 3D mesh to match the input image. The unprojection of the input image and deformed multiview images onto LRM's generated mesh ensures high clarity, discarding LRM's predicted blurry-looking mesh colors. Extensive qualitative and quantitative experiments verify Fancy123's SoTA performance with significant improvement. Also, the two enhancement modules are plug-and-play and work at inference time, allowing seamless integration into various existing single-image-to-3D methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems faced when generating high - quality 3D mesh models from a single image. Specifically, these problems include: 1. **Inconsistency between multi - view images**: The multi - view images generated by existing methods often have unaligned pixel positions in local areas, resulting in inconsistency between multi - views. 2. **Low fidelity to the input image**: The generated 3D mesh models usually do not match the input image well, and shape distortion or detail loss may occur. 3. **Color blur**: The generated 3D mesh models often look blurry, mainly due to the inconsistency between multi - view images and the inaccurate color mapping from image to mesh. To address these problems, the paper proposes a new method named Fancy123, whose main features are as follows: - **Appearance enhancement module**: Correct the inconsistency in multi - view images through 2D image deformation, thereby improving the consistency of multi - views. Specifically, this module optimizes a grid - based 2D deformation field so that the deformed multi - view images can match better when projected onto the initial grid. Formula representation: \[ I' = D_{2D}(I, F) \] where \(I'\) and \(F\) represent the deformed image and the deformation field respectively. - **Fidelity enhancement module**: Make the generated 3D mesh closer to the input image through 3D mesh deformation, thereby improving the fidelity. Specifically, this module uses the Jacobian field to parameterize the mesh deformation and optimizes this field to minimize the difference between the mesh rendering result and the input image. Formula representation: \[ V'=\arg\min_V \|LV - \nabla^T AJ\|^2 \] where \(L\) is the Laplacian operator, \(\nabla\) is the gradient operator, \(A\) is the mass matrix, and \(J\) is the Jacobian field. - **Back - projection operation**: Ensure that the finally generated 3D mesh has high definition and good color mapping by back - projecting the input image and the deformed multi - view images onto the initial grid generated by the large - scale reconstruction model (LRM). In summary, Fancy123 significantly improves the quality of generating 3D mesh models from a single image by introducing two enhancement modules (appearance enhancement and fidelity enhancement) and the back - projection operation, and solves the problems of multi - view inconsistency, low fidelity, and color blur in existing methods.