Zhixuan Yu,Ziqian Bai,Abhimitra Meka,Feitong Tan,Qiangeng Xu,Rohit Pandey,Sean Fanello,Hyun Soo Park,Yinda Zhang
Abstract:Traditional methods for constructing high-quality, personalized head avatars
from monocular videos demand extensive face captures and training time, posing
a significant challenge for scalability. This paper introduces a novel approach
to create high quality head avatar utilizing only a single or a few images per
user. We learn a generative model for 3D animatable photo-realistic head avatar
from a multi-view dataset of expressions from 2407 subjects, and leverage it as
a prior for creating personalized avatar from few-shot images. Different from
previous 3D-aware face generative models, our prior is built with a
3DMM-anchored neural radiance field backbone, which we show to be more
effective for avatar creation through auto-decoding based on few-shot inputs.
We also handle unstable 3DMM fitting by jointly optimizing the 3DMM fitting and
camera calibration that leads to better few-shot adaptation. Our method
demonstrates compelling results and outperforms existing state-of-the-art
methods for few-shot avatar adaptation, paving the way for more efficient and
personalized avatar creation.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
The paper aims to solve the problem of generating high - quality, animatable and realistic avatars from a small number of or a single image. Traditional methods require a large amount of facial capture data and training time when constructing high - quality personalized avatars, which greatly limits the scalability of these methods. This paper proposes a new method that can generate high - quality 3D avatars with only one or more images of the target person. Specifically, the method addresses the shortcomings of existing methods through the following points:
1. **Reduced data requirements**: Utilize the generative model learned from large - scale multi - view and multi - expression datasets as a prior, thereby reducing the need for a large amount of specific user data.
2. **High - quality generation**: Generate high - fidelity 3D avatars by combining 3D Morphable Model (3DMM) and Neural Radiance Field (NeRF).
3. **Improved stability**: Improve the stability of few - shot adaptation by jointly optimizing 3DMM fitting, camera calibration and model weights, and reduce animation artifacts and identity shift.
4. **Enhanced generalization ability**: Improve the model's generalization ability for extreme views and unseen expressions through the learning of multi - view and multi - expression data.
### Method overview
1. **Multi - view and multi - expression face capture**:
- Capture high - resolution facial images of 2407 subjects under 13 predefined expressions, with each subject photographed from 13 sparse camera views for each expression.
- Reconstruct the 3D geometric structure using a 3DMM fitting algorithm based on facial feature points.
2. **Generate avatar prior**:
- Represent the avatar using 3DMM - anchored Neural Radiance Field (NeRF), and attach local features to the vertices of the 3DMM mesh.
- The identity branch uses StyleGAN2 to generate identity feature maps, and the expression branch uses U - Net to generate expression feature maps.
- Sample the two feature maps to 3DMM vertices through texture coordinates to establish a 3DMM - anchored Neural Radiance Field.
3. **Few - shot adaptation**:
- Initialize the target latent code as the average latent code of the training subjects and jointly optimize it with the model weights.
- Adopt the PTI (Pivotal Tuning Inversion) strategy, alternating between model inverse optimization and fine - tuning.
- Improve the stability and performance of few - shot adaptation by jointly optimizing camera pose, 3DMM expression parameters and model weights.
4. **Training scheme**:
- Use a multi - view dataset for training, and the loss function includes the photometric loss between the rendered image and the real image.
- Adopt an auto - decoder training strategy, with each identity having a 512 - dimensional latent code.
- Randomly sample pixels for training, with a batch size of 131072, and optimize using the Adam optimizer.
### Experimental results
1. **Evaluation datasets, metrics and baselines**:
- Dataset: Use the monocular selfie videos of 6 subjects in the MonoAvatar dataset, with each subject containing two sessions of training and testing.
- Metrics: Use LPIPS, PSNR and SSIM for quantitative evaluation.
- Baselines: Include MonoAvatar, Next3D, Ours - TP, Ours - FFHQ and Ours - SV.
2. **Few - shot adaptation comparison**:
- Compared with Next3D and MonoAvatar, the method in this paper is significantly superior to other methods in few - shot settings (such as 1 image).
- Even when using 100% of the data, the method in this paper still maintains superior performance, indicating that the generated prior provides a good network weight initialization.
- Qualitative results show that the avatars generated by the method in this paper have more accurate expressions, more consistent identities and fewer artifacts.
### Conclusion
The method proposed in this paper effectively solves the problem of generating high - quality, animatable avatars from a small number of or a single image by learning the generative prior from multi - view and multi - expression data. Experimental results show that this method performs well in few - shot adaptation and has good generalization ability and stability.