OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis

Hongyi Xu,Guoxian Song,Zihang Jiang,Jianfeng Zhang,Yichun Shi,Jing Liu,Wanchun Ma,Jiashi Feng,Linjie Luo
DOI: https://doi.org/10.48550/arXiv.2303.15539
2023-03-28
Abstract:We present OmniAvatar, a novel geometry-guided 3D head synthesis model trained from in-the-wild unstructured images that is capable of synthesizing diverse identity-preserved 3D heads with compelling dynamic details under full disentangled control over camera poses, facial expressions, head shapes, articulated neck and jaw poses. To achieve such high level of disentangled control, we first explicitly define a novel semantic signed distance function (SDF) around a head geometry (FLAME) conditioned on the control parameters. This semantic SDF allows us to build a differentiable volumetric correspondence map from the observation space to a disentangled canonical space from all the control parameters. We then leverage the 3D-aware GAN framework (EG3D) to synthesize detailed shape and appearance of 3D full heads in the canonical space, followed by a volume rendering step guided by the volumetric correspondence map to output into the observation space. To ensure the control accuracy on the synthesized head shapes and expressions, we introduce a geometry prior loss to conform to head SDF and a control loss to conform to the expression code. Further, we enhance the temporal realism with dynamic details conditioned upon varying expressions and joint poses. Our model can synthesize more preferable identity-preserved 3D heads with compelling dynamic details compared to the state-of-the-art methods both qualitatively and quantitatively. We also provide an ablation study to justify many of our system design choices.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "OmniAvatar: Geometry - Guided Controllable 3D Head Synthesis" aims to solve the following problems: 1. **High - precision 3D head synthesis**: Existing 3D head synthesis methods have deficiencies in controlling camera view, facial expressions, head shape, neck and jaw postures, etc., and cannot achieve highly decoupled control. This paper proposes a new geometry - guided 3D head synthesis model that can achieve fine - grained control in these aspects. 2. **Generation of dynamic details**: Existing methods perform poorly in generating dynamic details (such as wrinkles, light and shadow changes, etc.), especially when different expressions and postures change. In this paper, by introducing noise - conditional expressions, the generation of dynamic details is enhanced, making the synthesized 3D head more realistic. 3. **High - quality image synthesis**: Existing 3D head synthesis methods still have room for improvement in image quality. In this paper, by combining 3D - aware generative adversarial networks (3D GAN) and neural radiance fields (NeRF) techniques, high - quality multi - view - consistent image synthesis is achieved. 4. **3D reconstruction from single - view images**: Existing methods usually require multi - view data for 3D reconstruction, while the method in this paper can achieve high - quality 3D head reconstruction only from a single - view image and support multi - view - consistent head reenactment. ### Main contributions - **Novel geometry - guided 3D GAN framework**: It can achieve comprehensive control of camera view, facial expressions, head shape, neck and jaw postures. - **Semantic signed - distance function (SDF)**: Defines a volume correspondence map from the observation space to the canonical space, allowing for complete decoupling of control parameters in 3D GAN training. - **Geometry prior loss and control loss**: Ensure the accuracy of the synthesized 3D head shape and expressions. - **Noise - conditional expressions**: By introducing noise - conditional expressions, the generation of dynamic details is enhanced and the temporal consistency is improved. ### Method overview 1. **Semantic signed - distance function (SDF)**: - Defines a new semantic signed - distance function \(W(x|p = (\alpha,\beta,\theta))=(s,\bar{x})\), where \(\alpha\) and \(\beta\) represent the linear shape and expression blend - shape coefficients respectively, and \(\theta\) controls the 3 - degree - of - freedom jaw and neck joint rotation. - Given a point \(x\) in the observation space, the function \(W\) returns its corresponding point \(\bar{x}\) in the canonical space and calculates its nearest signed - distance \(s(x|p)\) to the FLAME mesh surface. 2. **Canonical generation and geometry prior**: - Utilize the pre - trained semantic SDF model \(W(x|p)\) to model shape and expression changes and use tri - plane to generate 3D - aware human heads. - Introduce a geometry prior loss \(L_{\text{prior}}\) to guide the generation of the neural radiance density field so that it conforms to the FLAME head geometry. 3. **Fine - grained expression control**: - Use an image - level supervision loss \(L_{\text{enc}}\) to improve the precision of expression control and ensure that the expression of the synthesized image is consistent with the input control parameters. 4. **Dynamic detail modeling**: - By introducing noise - conditional expressions \(\beta\) and \(\theta\) in the MLP decoder, the generation of dynamic details is enhanced, making the synthesized 3D head show more realistic details when different expressions and postures change. ### Experimental results - **Quantitative comparison**: The method in this paper outperforms existing 2D and 3D controllable image synthesis methods in both image quality and control decoupling. - **Ablation study**: Verifies the role of the geometry prior loss \(L_{\text{prior}}\) and the self - supervised reconstruction loss \(L_{\text{enc}}\) in improving the precision of shape and expression control. ### Conclusion This paper proposes OmniAvata