Abstract:Reconstructing photorealistic and dynamic portrait avatars from images is essential to many applications including advertising, visual effects, and virtual reality. Depending on the application, avatar reconstruction involves different capture setups and constraints $-$ for example, visual effects studios use camera arrays to capture hundreds of reference images, while content creators may seek to animate a single portrait image downloaded from the internet. As such, there is a large and heterogeneous ecosystem of methods for avatar reconstruction. Techniques based on multi-view stereo or neural rendering achieve the highest quality results, but require hundreds of reference images. Recent generative models produce convincing avatars from a single reference image, but visual fidelity yet lags behind multi-view techniques. Here, we present CAP4D: an approach that uses a morphable multi-view diffusion model to reconstruct photoreal 4D (dynamic 3D) portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time. Our approach demonstrates state-of-the-art performance for single-, few-, and multi-image 4D portrait avatar reconstruction, and takes steps to bridge the gap in visual fidelity between single-image and multi-view reconstruction techniques.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Reconstruct realistic and dynamic 4D (dynamic 3D) avatars from images and enable them to be animated and rendered in real - time environments**. Specifically, the authors focus on how to use any number of reference images (from one to hundreds), to generate high - quality, realistic 4D avatars that can not only maintain the similarity to the person in the input images, but also perform natural animation under different viewpoints and expressions. ### Problem Background 1. **Multi - view Reconstruction vs. Single - view Reconstruction**: - Multi - view stereo vision or multi - view neural rendering techniques can generate high - quality 4D avatars, but they require a large number of reference images. - Generation models based on a single reference image can generate convincing avatars, but still lag behind multi - view methods in visual fidelity. 2. **Limitations of Existing Methods**: - Existing 4D avatar reconstruction methods are difficult to seamlessly extend to the case from one to hundreds of reference images while maintaining consistent high - quality results. - Many diffusion - model - based methods are very computationally expensive and difficult to achieve real - time rendering and animation. ### Solution The paper proposes a new method named CAP4D, which uses a deformable multi - view diffusion model (MMDM) to reconstruct 4D avatars. The specific steps are as follows: 1. **Generate Multi - view Images**: - Use MMDM to generate a large number of new view images from the input reference images, including different expressions and angles. - Ensure the consistency and diversity of the generated images through a random input / output (I / O) conditioning process. 2. **Construct 4D Avatars**: - Use the generated images and reference images to train a 4D avatar model based on 3D Gaussian lattice representation. - This model can be animated and rendered in real - time environments and can capture subtle expression changes. ### Main Contributions 1. **Propose MMDM**: It is used for multi - view portrait image generation and introduces a random I / O conditioning process to generate self - consistent portrait images. 2. **Develop Real - time 4D Avatar Technology**: Refine the generated portrait images into a 4D avatar that can be animated and rendered in real - time. 3. **Extensive Evaluation**: Conduct an exhaustive evaluation of self - reproduction and cross - identity reproduction tasks, demonstrating its top - notch performance under various input conditions. Through these innovations, CAP4D achieves the current best 4D avatar reconstruction effect in the cases of single - view, few - view, and multi - view inputs, significantly improving the visual quality and identity fidelity.

CAP4D: Creating Animatable 4D Portrait Avatars with Morphable Multi-View Diffusion Models

Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation

AnimateMe: 4D Facial Expressions via Diffusion Models

FitMe: Deep Photorealistic 3D Morphable Model Avatars

DynamicAvatars: Accurate Dynamic Facial Avatars Reconstruction and Precise Editing with Diffusion Models

Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer

Real-time Facial Animation with Image-Based Dynamic Avatars.

HQ3DAvatar: High Quality Implicit 3D Head Avatar

AniArtAvatar: Animatable 3D Art Avatar from a Single Image

Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

RodinHD: High-Fidelity 3D Avatar Generation with Diffusion Models

HQ3DAvatar: High Quality Controllable 3D Head Avatar

3Dtoonify: Creating Your High-Fidelity 3D Stylized Avatar Easily from 2D Portrait Images

RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars

Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction

Deformable 3D Gaussian Splatting for Animatable Human Avatars

A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation

Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data

DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars

GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

Animated 3D Human Avatars from a Single Image with GAN-based Texture Inference.