Universal Facial Encoding of Codec Avatars from VR Headsets

Shaojie Bai,Te-Li Wang,Chenghui Li,Akshay Venkatesh,Tomas Simon,Chen Cao,Gabriel Schwartz,Ryan Wrench,Jason Saragih,Yaser Sheikh,Shih-En Wei
DOI: https://doi.org/10.1145/3658234
2024-07-18
Abstract:Faithful real-time facial animation is essential for avatar-mediated telepresence in Virtual Reality (VR). To emulate authentic communication, avatar animation needs to be efficient and accurate: able to capture both extreme and subtle expressions within a few milliseconds to sustain the rhythm of natural conversations. The oblique and incomplete views of the face, variability in the donning of headsets, and illumination variation due to the environment are some of the unique challenges in generalization to unseen faces. In this paper, we present a method that can animate a photorealistic avatar in realtime from head-mounted cameras (HMCs) on a consumer VR headset. We present a self-supervised learning approach, based on a cross-view reconstruction objective, that enables generalization to unseen users. We present a lightweight expression calibration mechanism that increases accuracy with minimal additional cost to run-time efficiency. We present an improved parameterization for precise ground-truth generation that provides robustness to environmental variation. The resulting system produces accurate facial animation for unseen users wearing VR headsets in realtime. We compare our approach to prior face-encoding methods demonstrating significant improvements in both quantitative metrics and qualitative results.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve efficient and accurate facial animation generation in virtual reality (VR) environments, especially when using consumer - grade VR headsets. Specifically, the authors aim to develop a general - purpose facial coding method that can generate realistic digital avatars in real - time from facial data captured by cameras on VR headsets. This challenge mainly comes from the following aspects: 1. **Balance between high - fidelity and low - latency**: In order to simulate real - life communication scenarios, the animation of digital avatars needs to be both efficient and accurate, being able to capture extreme and subtle expression changes within a few milliseconds to maintain the rhythm of natural conversations. This requires the system to maintain extremely low latency while ensuring high - quality image generation. 2. **Incompleteness of facial data and view - angle limitations**: Due to the occlusion of VR headsets, some parts of the face (especially the upper part) are difficult to be fully captured. In addition, factors such as differences in the way of wearing the device and changes in ambient lighting also increase the difficulty of obtaining facial data. 3. **Generalization ability for unseen users**: Traditional facial coding methods usually require a large amount of training data collection and model training for each user, which is not only time - consuming but also costly. The method proposed in this paper aims to enable the system to generate accurate facial animations for new unseen users through techniques such as self - supervised learning, without the need for additional personalized training. To solve the above problems, the authors propose a self - supervised - learning - based framework, which uses a large amount of unpaired headset - captured data for training and drives the learning of identity - independent facial expression features through cross - view reconstruction objectives. In addition, they also design a lightweight calibration mechanism. By having the user make several predefined anchor expressions (such as maximum mouth opening, wide - eyed, etc.) before an actual call, the coding accuracy is improved while hardly increasing the runtime computational cost. These innovations enable the system to generate facial animations efficiently and accurately for different users and environments.