Abstract:Head avatar reconstruction, crucial for applications in virtual reality, online meetings, gaming, and film industries, has garnered substantial attention within the computer vision community. The fundamental objective of this field is to faithfully recreate the head avatar and precisely control expressions and postures. Existing methods, categorized into 2D-based warping, mesh-based, and neural rendering approaches, present challenges in maintaining multi-view consistency, incorporating non-facial information, and generalizing to new identities. In this paper, we propose a framework named GPAvatar that reconstructs 3D head avatars from one or several images in a single forward pass. The key idea of this work is to introduce a dynamic point-based expression field driven by a point cloud to precisely and effectively capture expressions. Furthermore, we use a Multi Tri-planes Attention (MTA) fusion module in the tri-planes canonical field to leverage information from multiple input images. The proposed method achieves faithful identity reconstruction, precise expression control, and multi-view consistency, demonstrating promising results for free-viewpoint rendering and novel view synthesis.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to reconstruct high - quality 3D avatars from one or more images and achieve precise control of expressions and postures. Specifically, the authors are concerned with the challenges existing in multi - view consistency, non - facial information fusion, and new - identity generalization in current methods. They propose a new framework named GPAvatar, aiming to achieve faithful identity reconstruction, precise expression control, and multi - view consistency in a single - forward - propagation manner by introducing the Point - based Expression Field (PEF) and the Multi Tri - planes Attention (MTA). ### Core problems of the paper 1. **Multi - view consistency**: Existing 2D deformation methods have difficulty maintaining multi - view consistency when dealing with large - scale changes in head postures. 2. **Non - facial information fusion**: Although existing mesh - model methods can handle multi - view consistency relatively well, they perform poorly when modeling non - facial information such as hair. 3. **New - identity generalization**: Although NeRF - based methods perform excellently in multi - view synthesis, they have poor generalization ability for new identities and require a large amount of data for optimization. ### Main contributions of GPAvatar 1. **Reconstruction in a single - forward - propagation**: It can complete high - quality 3D avatar reconstruction in one forward - propagation. 2. **Point - cloud - based Dynamic Expression Field (PEF)**: Through the point - cloud - driven dynamic expression field, natural and precise cross - identity expression control is achieved. 3. **Multi Tri - planes Attention Module (MTA)**: It can flexibly accept single or multiple input images, and enhance information integration through the multi - tri - plane attention module, especially suitable for extreme cases such as closed eyes or occlusion. ### Experimental verification The paper verifies the effectiveness of GPAvatar through experiments on the VFHQ and HDTF datasets. The experimental results show that GPAvatar not only performs excellently in the same - identity reconstruction task, but also demonstrates strong generalization ability and precise expression control ability in the cross - identity reconstruction task. ### Formula representation The formulas involved in the paper include: 1. **Expression feature calculation**: \[ f_{\text{exp}, x} = \sum_{i = 1}^{K} w_i \sum_{j = 1}^{K} w_j L_p(f_i, F_{\text{pos}}(p_i - x)), \quad \text{where} \quad w_i=\frac{1}{\|p_i - x\|} \] Here, \( L_p \) is a linear layer, and \( F_{\text{pos}} \) is a position - encoding function. 2. **Multi - tri - plane attention module**: \[ P=\sum_{i = 1}^{N} w_i \sum_{j = 1}^{N} w_j E(I_i), \quad \text{where} \quad w_i = L_q(Q)L_k(E(I_i)) \] where \( I_i \) is an input image, \( N \) is the number of input images, \( E \) is a standard encoder, and \( L_q \) and \( L_k \) are linear layers for generating queries and keys, and \( Q \) is a learnable query tri - plane. These formulas ensure the efficiency and accuracy of GPAvatar in expression feature extraction and multi - input information fusion.

GPAvatar: Generalizable and Precise Head Avatar from Image(s)

Generalizable and Animatable Gaussian Head Avatar

MonoGaussianAvatar: Monocular Gaussian Point-based Head Avatar

GAN-Avatar: Controllable Personalized GAN-based Human Head Avatar

GFAvatar: A High-Quality Facial Avatar Reconstruction Method

GPHM: Gaussian Parametric Head Model for Monocular Head Avatar Reconstruction

HeadGAP: Few-shot 3D Head Avatar via Generalizable Gaussian Priors

FAGhead: Fully Animate Gaussian Head from Monocular Videos

GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image

HQ3DAvatar: High Quality Implicit 3D Head Avatar

HQ3DAvatar: High Quality Controllable 3D Head Avatar

AvatarWild: Fully Controllable Head Avatars in the Wild

3D Gaussian Parametric Head Model

GGAvatar: Geometric Adjustment of Gaussian Head Avatar

TimeWalker: Personalized Neural Space for Lifelong Head Avatars

PSAvatar: A Point-based Shape Model for Real-Time Head Avatar Animation with 3D Gaussian Splatting

Universal Facial Encoding of Codec Avatars from VR Headsets

GaussianHead: High-fidelity Head Avatars with Learnable Gaussian Derivation

Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar

Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos