DEGAS: Detailed Expressions on Full-Body Gaussian Avatars

Zhijing Shao,Duotun Wang,Qing-Yao Tian,Yao-Dong Yang,Hengyu Meng,Zeyu Cai,Bo Dong,Yu Zhang,Kang Zhang,Zeyu Wang
2024-08-20
Abstract:Although neural rendering has made significant advancements in creating lifelike, animatable full-body and head avatars, incorporating detailed expressions into full-body avatars remains largely unexplored. We present DEGAS, the first 3D Gaussian Splatting (3DGS)-based modeling method for full-body avatars with rich facial expressions. Trained on multiview videos of a given subject, our method learns a conditional variational autoencoder that takes both the body motion and facial expression as driving signals to generate Gaussian maps in the UV layout. To drive the facial expressions, instead of the commonly used 3D Morphable Models (3DMMs) in 3D head avatars, we propose to adopt the expression latent space trained solely on 2D portrait images, bridging the gap between 2D talking faces and 3D avatars. Leveraging the rendering capability of 3DGS and the rich expressiveness of the expression latent space, the learned avatars can be reenacted to reproduce photorealistic rendering images with subtle and accurate facial expressions. Experiments on an existing dataset and our newly proposed dataset of full-body talking avatars demonstrate the efficacy of our method. We also propose an audio-driven extension of our method with the help of 2D talking faces, opening new possibilities to interactive AI agents.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to add rich facial expression expressions to full - body three - dimensional virtual avatars. Specifically, existing neural rendering techniques have been able to create realistic and animatable full - body or head virtual avatars, but integrating detailed facial expressions in full - body virtual avatars remains an under - explored area. The paper proposes DEGAS (Detailed Expressions on Full - Body Gaussian Avatars), which is the first modeling method based on 3D Gaussian Splatting (3DGS) for generating full - body virtual avatars with rich facial expressions. Through multi - view video training, this method can learn a conditional variational auto - encoder, which takes both body movements and facial expressions as driving signals to generate Gaussian maps in the UV layout. To drive facial expressions, the paper proposes to adopt an expression latent space trained only from 2D portrait images, thereby bridging the gap between 2D talking faces and 3D virtual avatars. This enables the learned virtual avatars to be re - enacted to reproduce photo - realistic rendered images with subtle and accurate facial expressions. The experimental results demonstrate the effectiveness of this method, especially its performance on existing datasets and the newly proposed full - body talking virtual avatar dataset. In addition, the paper also proposes an audio - driven method extension, which, with the help of 2D talking face technology, opens up new possibilities for interactive AI agents.