Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

Xiyi Chen,Marko Mihajlovic,Shaofei Wang,Sergey Prokudin,Siyu Tang
2024-04-02
Abstract:Recent advances in generative diffusion models have enabled the previously unfeasible capability of generating 3D assets from a single input image or a text prompt. In this work, we aim to enhance the quality and functionality of these models for the task of creating controllable, photorealistic human avatars. We achieve this by integrating a 3D morphable model into the state-of-the-art multi-view-consistent diffusion approach. We demonstrate that accurate conditioning of a generative pipeline on the articulated 3D model enhances the baseline model performance on the task of novel view synthesis from a single image. More importantly, this integration facilitates a seamless and accurate incorporation of facial expression and body pose control into the generation process. To the best of our knowledge, our proposed framework is the first diffusion model to enable the creation of fully 3D-consistent, animatable, and photorealistic human avatars from a single image of an unseen subject; extensive quantitative and qualitative evaluations demonstrate the advantages of our approach over existing state-of-the-art avatar creation models on both novel view and novel expression synthesis tasks. The code for our project is publicly available.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to generate high - quality, controllable, photo - realistic human avatars, and be able to create fully 3D - consistent, animatable, photo - realistic human avatars from a single image. Specifically, the authors aim to improve the quality and functionality of these models in the task of generating new views from a single image by integrating the 3D morphable model (3DMM) into the state - of - the - art multi - view consistent diffusion model. In addition, they also hope to achieve seamless and accurate integration of facial expressions and body pose control in the generation process through this integration. ### Main contributions: 1. **Improve the quality of generated images**: By effectively using the deformable 3D model to condition the generation process, the method proposed by the authors significantly improves the quality of generated images on most metrics. 2. **Enable new facial expression generation**: A more efficient training scheme is proposed, making it possible to generate new facial expressions of unseen subjects from a single image. 3. **For the first time, generate an animatable high - fidelity head model from a single image**: This is the first time that a highly photo - realistic animatable head model of an unseen subject with an unseen facial expression as a driving signal can be generated from a single input image. ### Problems solved: - **3D consistency problem**: Existing methods often fail to maintain 3D consistency when generating new views, especially when dealing with complex facial expression changes. The model proposed in this paper solves this problem by introducing the 3D morphable model, ensuring the consistency of generated images from different perspectives. - **Facial expression and body pose control**: Existing methods are difficult to achieve precise control of facial expressions and body poses during the generation process. This paper achieves effective control of these features by combining the 3D morphable model with the diffusion model. - **Generate high - quality avatars from a single image**: Traditional 3D - scanning - based methods require a large amount of visual input, such as multi - view images or monocular videos. The method in this paper can generate high - quality avatars from a single image only, greatly reducing the data requirements. ### Technological innovations: - **Conditional constraint of the 3D deformable model**: By using the 3D deformable model as a conditional input to guide the generation process of the diffusion model, the 3D consistency and visual quality of the generated images are improved. - **Efficient training strategy**: A new training strategy is proposed. By randomly sampling different facial expressions during the training process, explicit control of the generated images is achieved and the training efficiency is improved. In conclusion, this paper proposes a new method by combining the 3D deformable model and the diffusion model, which can achieve high - quality, controllable, photo - realistic human avatar generation in the task of generating new views from a single image. This is not only an important technological breakthrough but also provides new possibilities for virtual reality, digital entertainment and other fields.