GPAvatar: Generalizable and Precise Head Avatar from Image(s)

Xuangeng Chu,Yu Li,Ailing Zeng,Tianyu Yang,Lijian Lin,Yunfei Liu,Tatsuya Harada
2024-01-19
Abstract:Head avatar reconstruction, crucial for applications in virtual reality, online meetings, gaming, and film industries, has garnered substantial attention within the computer vision community. The fundamental objective of this field is to faithfully recreate the head avatar and precisely control expressions and postures. Existing methods, categorized into 2D-based warping, mesh-based, and neural rendering approaches, present challenges in maintaining multi-view consistency, incorporating non-facial information, and generalizing to new identities. In this paper, we propose a framework named GPAvatar that reconstructs 3D head avatars from one or several images in a single forward pass. The key idea of this work is to introduce a dynamic point-based expression field driven by a point cloud to precisely and effectively capture expressions. Furthermore, we use a Multi Tri-planes Attention (MTA) fusion module in the tri-planes canonical field to leverage information from multiple input images. The proposed method achieves faithful identity reconstruction, precise expression control, and multi-view consistency, demonstrating promising results for free-viewpoint rendering and novel view synthesis.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to reconstruct high - quality 3D avatars from one or more images and achieve precise control of expressions and postures. Specifically, the authors are concerned with the challenges existing in multi - view consistency, non - facial information fusion, and new - identity generalization in current methods. They propose a new framework named GPAvatar, aiming to achieve faithful identity reconstruction, precise expression control, and multi - view consistency in a single - forward - propagation manner by introducing the Point - based Expression Field (PEF) and the Multi Tri - planes Attention (MTA). ### Core problems of the paper 1. **Multi - view consistency**: Existing 2D deformation methods have difficulty maintaining multi - view consistency when dealing with large - scale changes in head postures. 2. **Non - facial information fusion**: Although existing mesh - model methods can handle multi - view consistency relatively well, they perform poorly when modeling non - facial information such as hair. 3. **New - identity generalization**: Although NeRF - based methods perform excellently in multi - view synthesis, they have poor generalization ability for new identities and require a large amount of data for optimization. ### Main contributions of GPAvatar 1. **Reconstruction in a single - forward - propagation**: It can complete high - quality 3D avatar reconstruction in one forward - propagation. 2. **Point - cloud - based Dynamic Expression Field (PEF)**: Through the point - cloud - driven dynamic expression field, natural and precise cross - identity expression control is achieved. 3. **Multi Tri - planes Attention Module (MTA)**: It can flexibly accept single or multiple input images, and enhance information integration through the multi - tri - plane attention module, especially suitable for extreme cases such as closed eyes or occlusion. ### Experimental verification The paper verifies the effectiveness of GPAvatar through experiments on the VFHQ and HDTF datasets. The experimental results show that GPAvatar not only performs excellently in the same - identity reconstruction task, but also demonstrates strong generalization ability and precise expression control ability in the cross - identity reconstruction task. ### Formula representation The formulas involved in the paper include: 1. **Expression feature calculation**: \[ f_{\text{exp}, x} = \sum_{i = 1}^{K} w_i \sum_{j = 1}^{K} w_j L_p(f_i, F_{\text{pos}}(p_i - x)), \quad \text{where} \quad w_i=\frac{1}{\|p_i - x\|} \] Here, \( L_p \) is a linear layer, and \( F_{\text{pos}} \) is a position - encoding function. 2. **Multi - tri - plane attention module**: \[ P=\sum_{i = 1}^{N} w_i \sum_{j = 1}^{N} w_j E(I_i), \quad \text{where} \quad w_i = L_q(Q)L_k(E(I_i)) \] where \( I_i \) is an input image, \( N \) is the number of input images, \( E \) is a standard encoder, and \( L_q \) and \( L_k \) are linear layers for generating queries and keys, and \( Q \) is a learnable query tri - plane. These formulas ensure the efficiency and accuracy of GPAvatar in expression feature extraction and multi - input information fusion.