Abstract:Digital humans and, especially, 3D facial avatars have raised a lot of attention in the past years, as they are the backbone of several applications like immersive telepresence in AR or VR. Despite the progress, facial avatars reconstructed from commodity hardware are incomplete and miss out on parts of the side and back of the head, severely limiting the usability of the avatar. This limitation in prior work stems from their requirement of face tracking, which fails for profile and back views. To address this issue, we propose to learn person-specific animatable avatars from images without assuming to have access to precise facial expression tracking. At the core of our method, we leverage a 3D-aware generative model that is trained to reproduce the distribution of facial expressions from the training data. To train this appearance model, we only assume to have a collection of 2D images with the corresponding camera parameters. For controlling the model, we learn a mapping from 3DMM facial expression parameters to the latent space of the generative model. This mapping can be learned by sampling the latent space of the appearance model and reconstructing the facial parameters from a normalized frontal view, where facial expression estimation performs well. With this scheme, we decouple 3D appearance reconstruction and animation control to achieve high fidelity in image synthesis. In a series of experiments, we compare our proposed technique to state-of-the-art monocular methods and show superior quality while not requiring expression tracking of the training data.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reconstruct personalized and animatable 3D human head avatars from image data without relying on precise facial expression tracking. Specifically, the paper proposes a generative adversarial network (GAN) method, which can be trained using only a set of 2D images and their corresponding camera parameters, without the need for geometric information per frame or precise tracking of facial expressions. This is different from previous methods, which usually require precise facial expression tracking information to reconstruct 3D avatars, and are prone to failure when dealing with side or back views, resulting in incomplete reconstructed avatars, especially in the side and back parts of the head. ### Main contributions of the paper: 1. **Generate a 3D - aware personalized avatar appearance model**: This model can be trained without precise facial expression tracking. By using the pre - trained EG3D model and fine - tuning it to match the distribution of the input data, a personalized 3D appearance model is generated. 2. **Expression mapping network**: This network maps the standard BFM (Blendshape Face Model) expression parameters to the latent space of the generative model, enabling the generation of new animations by controlling these expression parameters. ### Key problems solved: - **Avoid dependence on facial expression tracking**: Traditional 3D avatar reconstruction methods rely on precise facial expression tracking, which is prone to failure when dealing with side or back views. This method avoids this dependence through the generative model and the expression mapping network, thus being able to reconstruct 3D avatars more comprehensively. - **Improve reconstruction quality**: By decoupling 3D appearance reconstruction and animation control, this method can achieve high fidelity in image synthesis, especially performing better in the reconstruction of detailed areas such as teeth and hair. ### Method overview: 1. **3D - consistent appearance model**: Based on the EG3D model, generate a 3D - consistent appearance model, which can generate tri - plane features from random latent codes and generate images through volume rendering. 2. **Expression mapping network**: By generating paired data (i.e., pairs of latent codes and expression parameters), train a mapping network to map the expression parameters to the latent space of the generative model, thereby achieving control over the generative model. ### Experimental results: - **Quantitative evaluation**: On multiple metrics (such as MSE, PSNR, SSIM, LPIPS), the performance of this method is better than existing monocular avatar reconstruction methods. - **Qualitative evaluation**: The generated 3D avatars not only perform well in the front view, but also can generate high - quality side and back views, demonstrating its superior performance in the full - view perspective. In conclusion, the paper proposes an innovative method that solves the limitations of existing 3D avatar reconstruction methods when dealing with side and back views, providing an efficient solution without the need for precise facial expression tracking.

GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar

GANHead: Towards Generative Animatable Neural Head Avatars

Animated 3D Human Avatars from a Single Image with GAN-based Texture Inference.

GPAvatar: Generalizable and Precise Head Avatar from Image(s)

Generalizable and Animatable Gaussian Head Avatar

Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos

TimeWalker: Personalized Neural Space for Lifelong Head Avatars

GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

One2Avatar: Generative Implicit Head Avatar For Few-shot User Adaptation

HQ3DAvatar: High Quality Controllable 3D Head Avatar

Neural Head Avatars from Monocular RGB Videos

GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image

MonoGaussianAvatar: Monocular Gaussian Point-based Head Avatar

Gaze Generation for Avatars Using GANs

AniArtAvatar: Animatable 3D Art Avatar from a Single Image

HQ3DAvatar: High Quality Implicit 3D Head Avatar

XAGen: 3D Expressive Human Avatars Generation

AG3D: Learning to Generate 3D Avatars from 2D Image Collections

NECA: Neural Customizable Human Avatar

Towards Native Generative Model for 3D Head Avatar

FAGhead: Fully Animate Gaussian Head from Monocular Videos