CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

Felix Taubner,Ruihang Zhang,Mathieu Tuli,David B. Lindell
2024-12-17
Abstract:Reconstructing photorealistic and dynamic portrait avatars from images is essential to many applications including advertising, visual effects, and virtual reality. Depending on the application, avatar reconstruction involves different capture setups and constraints $-$ for example, visual effects studios use camera arrays to capture hundreds of reference images, while content creators may seek to animate a single portrait image downloaded from the internet. As such, there is a large and heterogeneous ecosystem of methods for avatar reconstruction. Techniques based on multi-view stereo or neural rendering achieve the highest quality results, but require hundreds of reference images. Recent generative models produce convincing avatars from a single reference image, but visual fidelity yet lags behind multi-view techniques. Here, we present CAP4D: an approach that uses a morphable multi-view diffusion model to reconstruct photoreal 4D (dynamic 3D) portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time. Our approach demonstrates state-of-the-art performance for single-, few-, and multi-image 4D portrait avatar reconstruction, and takes steps to bridge the gap in visual fidelity between single-image and multi-view reconstruction techniques.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Reconstruct realistic and dynamic 4D (dynamic 3D) avatars from images and enable them to be animated and rendered in real - time environments**. Specifically, the authors focus on how to use any number of reference images (from one to hundreds), to generate high - quality, realistic 4D avatars that can not only maintain the similarity to the person in the input images, but also perform natural animation under different viewpoints and expressions. ### Problem Background 1. **Multi - view Reconstruction vs. Single - view Reconstruction**: - Multi - view stereo vision or multi - view neural rendering techniques can generate high - quality 4D avatars, but they require a large number of reference images. - Generation models based on a single reference image can generate convincing avatars, but still lag behind multi - view methods in visual fidelity. 2. **Limitations of Existing Methods**: - Existing 4D avatar reconstruction methods are difficult to seamlessly extend to the case from one to hundreds of reference images while maintaining consistent high - quality results. - Many diffusion - model - based methods are very computationally expensive and difficult to achieve real - time rendering and animation. ### Solution The paper proposes a new method named CAP4D, which uses a deformable multi - view diffusion model (MMDM) to reconstruct 4D avatars. The specific steps are as follows: 1. **Generate Multi - view Images**: - Use MMDM to generate a large number of new view images from the input reference images, including different expressions and angles. - Ensure the consistency and diversity of the generated images through a random input / output (I / O) conditioning process. 2. **Construct 4D Avatars**: - Use the generated images and reference images to train a 4D avatar model based on 3D Gaussian lattice representation. - This model can be animated and rendered in real - time environments and can capture subtle expression changes. ### Main Contributions 1. **Propose MMDM**: It is used for multi - view portrait image generation and introduces a random I / O conditioning process to generate self - consistent portrait images. 2. **Develop Real - time 4D Avatar Technology**: Refine the generated portrait images into a 4D avatar that can be animated and rendered in real - time. 3. **Extensive Evaluation**: Conduct an exhaustive evaluation of self - reproduction and cross - identity reproduction tasks, demonstrating its top - notch performance under various input conditions. Through these innovations, CAP4D achieves the current best 4D avatar reconstruction effect in the cases of single - view, few - view, and multi - view inputs, significantly improving the visual quality and identity fidelity.