Fast Registration of Photorealistic Avatars for VR Facial Animation

Chaitanya Patel,Shaojie Bai,Te-Li Wang,Jason Saragih,Shih-En Wei
2024-07-19
Abstract:Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a personalized photorealistic avatar, and hence the acquisition of the labels for headset-mounted camera (HMC) images need to be efficient and accurate, while wearing a VR headset. This is challenging due to oblique camera views and differences in image modality. In this work, we first show that the domain gap between the avatar and HMC images is one of the primary sources of difficulty, where a transformer-based architecture achieves high accuracy on domain-consistent data, but degrades when the domain-gap is re-introduced. Building on this finding, we propose a system split into two parts: an iterative refinement module that takes in-domain inputs, and a generic avatar-guided image-to-image domain transfer module conditioned on current estimates. These two modules reinforce each other: domain transfer becomes easier when close-to-groundtruth examples are shown, and better domain-gap removal in turn improves the registration. Our system obviates the need for costly offline optimization, and produces online registration of higher quality than direct regression method. We validate the accuracy and efficiency of our approach through extensive experiments on a commodity headset, demonstrating significant improvements over these baselines. To stimulate further research in this direction, we make our large-scale dataset and code publicly available.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily addresses the issue of realistic facial animation presentation in Virtual Reality (VR) environments, specifically: 1. **Problem Background**: In VR social interactions, to enhance the immersive experience, it is necessary to accurately animate the user's personalized realistic avatar. This involves capturing images from the camera on the head-mounted display (VR headset) and precisely aligning these images with the user's avatar. 2. **Core Challenge**: When users wear a VR headset, their faces are obscured, hence relying on the built-in camera of the headset (Head-Mounted Camera, HMC) to capture facial expressions and drive the avatar is required. However, due to the skewed perspective of the camera and the differences in image modalities (such as the disparity between infrared and visible light), directly using these images for facial expression registration is challenging. 3. **Research Objective**: To propose an efficient and accurate method for registering facial expressions and head poses under challenging perspectives and unseen identities (i.e., different users). This method needs to overcome the domain gap between HMC images and avatar rendering and generate high-quality image-label pairs within a limited time frame. 4. **Method Overview**: - The paper first demonstrates that using a Transformer-based network can achieve high-precision expression estimation and head pose estimation when the camera and avatar modalities are matched. - Based on this finding, the authors propose a method consisting of two parts: an iterative refinement module and a universal, avatar-guided image-to-image domain transfer module. These two modules reinforce each other, improving the accuracy of registration by enhancing the quality of style transfer, and vice versa. 5. **Contributions Summary**: - Demonstrated that accurate and efficient universal facial registration can be achieved in matched camera-avatar domains without relying on 3D geometric information. - Proposed a universal style transfer network that precisely maintains the facial expressions of unseen identities. - Overall, provided a method for establishing high-fidelity image-label pairs for personalized avatars under time constraints and skewed perspectives. Through the aforementioned methods, the paper aims to improve the realism and immersion of facial animations in virtual reality environments, thereby enhancing the user experience.