One2Avatar: Generative Implicit Head Avatar For Few-shot User Adaptation

Zhixuan Yu,Ziqian Bai,Abhimitra Meka,Feitong Tan,Qiangeng Xu,Rohit Pandey,Sean Fanello,Hyun Soo Park,Yinda Zhang
2024-02-19
Abstract:Traditional methods for constructing high-quality, personalized head avatars from monocular videos demand extensive face captures and training time, posing a significant challenge for scalability. This paper introduces a novel approach to create high quality head avatar utilizing only a single or a few images per user. We learn a generative model for 3D animatable photo-realistic head avatar from a multi-view dataset of expressions from 2407 subjects, and leverage it as a prior for creating personalized avatar from few-shot images. Different from previous 3D-aware face generative models, our prior is built with a 3DMM-anchored neural radiance field backbone, which we show to be more effective for avatar creation through auto-decoding based on few-shot inputs. We also handle unstable 3DMM fitting by jointly optimizing the 3DMM fitting and camera calibration that leads to better few-shot adaptation. Our method demonstrates compelling results and outperforms existing state-of-the-art methods for few-shot avatar adaptation, paving the way for more efficient and personalized avatar creation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the problem of generating high - quality, animatable and realistic avatars from a small number of or a single image. Traditional methods require a large amount of facial capture data and training time when constructing high - quality personalized avatars, which greatly limits the scalability of these methods. This paper proposes a new method that can generate high - quality 3D avatars with only one or more images of the target person. Specifically, the method addresses the shortcomings of existing methods through the following points: 1. **Reduced data requirements**: Utilize the generative model learned from large - scale multi - view and multi - expression datasets as a prior, thereby reducing the need for a large amount of specific user data. 2. **High - quality generation**: Generate high - fidelity 3D avatars by combining 3D Morphable Model (3DMM) and Neural Radiance Field (NeRF). 3. **Improved stability**: Improve the stability of few - shot adaptation by jointly optimizing 3DMM fitting, camera calibration and model weights, and reduce animation artifacts and identity shift. 4. **Enhanced generalization ability**: Improve the model's generalization ability for extreme views and unseen expressions through the learning of multi - view and multi - expression data. ### Method overview 1. **Multi - view and multi - expression face capture**: - Capture high - resolution facial images of 2407 subjects under 13 predefined expressions, with each subject photographed from 13 sparse camera views for each expression. - Reconstruct the 3D geometric structure using a 3DMM fitting algorithm based on facial feature points. 2. **Generate avatar prior**: - Represent the avatar using 3DMM - anchored Neural Radiance Field (NeRF), and attach local features to the vertices of the 3DMM mesh. - The identity branch uses StyleGAN2 to generate identity feature maps, and the expression branch uses U - Net to generate expression feature maps. - Sample the two feature maps to 3DMM vertices through texture coordinates to establish a 3DMM - anchored Neural Radiance Field. 3. **Few - shot adaptation**: - Initialize the target latent code as the average latent code of the training subjects and jointly optimize it with the model weights. - Adopt the PTI (Pivotal Tuning Inversion) strategy, alternating between model inverse optimization and fine - tuning. - Improve the stability and performance of few - shot adaptation by jointly optimizing camera pose, 3DMM expression parameters and model weights. 4. **Training scheme**: - Use a multi - view dataset for training, and the loss function includes the photometric loss between the rendered image and the real image. - Adopt an auto - decoder training strategy, with each identity having a 512 - dimensional latent code. - Randomly sample pixels for training, with a batch size of 131072, and optimize using the Adam optimizer. ### Experimental results 1. **Evaluation datasets, metrics and baselines**: - Dataset: Use the monocular selfie videos of 6 subjects in the MonoAvatar dataset, with each subject containing two sessions of training and testing. - Metrics: Use LPIPS, PSNR and SSIM for quantitative evaluation. - Baselines: Include MonoAvatar, Next3D, Ours - TP, Ours - FFHQ and Ours - SV. 2. **Few - shot adaptation comparison**: - Compared with Next3D and MonoAvatar, the method in this paper is significantly superior to other methods in few - shot settings (such as 1 image). - Even when using 100% of the data, the method in this paper still maintains superior performance, indicating that the generated prior provides a good network weight initialization. - Qualitative results show that the avatars generated by the method in this paper have more accurate expressions, more consistent identities and fewer artifacts. ### Conclusion The method proposed in this paper effectively solves the problem of generating high - quality, animatable avatars from a small number of or a single image by learning the generative prior from multi - view and multi - expression data. Experimental results show that this method performs well in few - shot adaptation and has good generalization ability and stability.