Abstract:Faithful real-time facial animation is essential for avatar-mediated telepresence in Virtual Reality (VR). To emulate authentic communication, avatar animation needs to be efficient and accurate: able to capture both extreme and subtle expressions within a few milliseconds to sustain the rhythm of natural conversations. The oblique and incomplete views of the face, variability in the donning of headsets, and illumination variation due to the environment are some of the unique challenges in generalization to unseen faces. In this paper, we present a method that can animate a photorealistic avatar in realtime from head-mounted cameras (HMCs) on a consumer VR headset. We present a self-supervised learning approach, based on a cross-view reconstruction objective, that enables generalization to unseen users. We present a lightweight expression calibration mechanism that increases accuracy with minimal additional cost to run-time efficiency. We present an improved parameterization for precise ground-truth generation that provides robustness to environmental variation. The resulting system produces accurate facial animation for unseen users wearing VR headsets in realtime. We compare our approach to prior face-encoding methods demonstrating significant improvements in both quantitative metrics and qualitative results.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve efficient and accurate facial animation generation in virtual reality (VR) environments, especially when using consumer - grade VR headsets. Specifically, the authors aim to develop a general - purpose facial coding method that can generate realistic digital avatars in real - time from facial data captured by cameras on VR headsets. This challenge mainly comes from the following aspects: 1. **Balance between high - fidelity and low - latency**: In order to simulate real - life communication scenarios, the animation of digital avatars needs to be both efficient and accurate, being able to capture extreme and subtle expression changes within a few milliseconds to maintain the rhythm of natural conversations. This requires the system to maintain extremely low latency while ensuring high - quality image generation. 2. **Incompleteness of facial data and view - angle limitations**: Due to the occlusion of VR headsets, some parts of the face (especially the upper part) are difficult to be fully captured. In addition, factors such as differences in the way of wearing the device and changes in ambient lighting also increase the difficulty of obtaining facial data. 3. **Generalization ability for unseen users**: Traditional facial coding methods usually require a large amount of training data collection and model training for each user, which is not only time - consuming but also costly. The method proposed in this paper aims to enable the system to generate accurate facial animations for new unseen users through techniques such as self - supervised learning, without the need for additional personalized training. To solve the above problems, the authors propose a self - supervised - learning - based framework, which uses a large amount of unpaired headset - captured data for training and drives the learning of identity - independent facial expression features through cross - view reconstruction objectives. In addition, they also design a lightweight calibration mechanism. By having the user make several predefined anchor expressions (such as maximum mouth opening, wide - eyed, etc.) before an actual call, the coding accuracy is improved while hardly increasing the runtime computational cost. These innovations enable the system to generate facial animations efficiently and accurately for different users and environments.

Universal Facial Encoding of Codec Avatars from VR Headsets

Universal Facial Encoding of Codec Avatars from VR Headsets

Fast Registration of Photorealistic Avatars for VR Facial Animation

VR Facial Animation for Immersive Telepresence Avatars

High-fidelity facial and speech animation for VR HMDs

AvatarWild: Fully Controllable Head Avatars in the Wild

Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos

Real-time Expressive Avatar Animation Generation Based on Monocular Videos.

Real-time Facial Animation with Image-Based Dynamic Avatars.

Auto-CARD: Efficient and Robust Codec Avatar Driving for Real-time Mobile Telepresence

Attention-Based VR Facial Animation with Visual Mouth Camera Guidance for Immersive Telepresence Avatars

GPAvatar: Generalizable and Precise Head Avatar from Image(s)

Video-driven state-aware facial animation

Expressive Telepresence via Modular Codec Avatars

FreeAvatar: Robust 3D Facial Animation Transfer by Learning an Expression Foundation Model

Facial Expression Retargeting from Human to Avatar Made Easy

Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes

FATE: Full-head Gaussian Avatar with Textural Editing from Monocular Video

Facial performance sensing head-mounted display

URAvatar: Universal Relightable Gaussian Codec Avatars

AvatarReX: Real-time Expressive Full-body Avatars