Abstract:While head-mounted displays (HMDs) for Virtual Reality (VR) have become widely available in the consumer market, they pose a considerable obstacle for a realistic face-to-face conversation in VR since HMDs hide a significant portion of the participants faces. Even with image streams from cameras directly attached to an HMD, stitching together a convincing image of an entire face remains a challenging task because of extreme capture angles and strong lens distortions due to a wide field of view. Compared to the long line of research in VR, reconstruction of faces hidden beneath an HMD is a very recent topic of research. While the current state-of-the-art solutions demonstrate photo-realistic 3D reconstruction results, they require high-cost laboratory equipment and large computational costs. We present an approach that focuses on low-cost hardware and can be used on a commodity gaming computer with a single GPU. We leverage the benefits of an end-to-end pipeline by means of Generative Adversarial Networks (GAN). Our GAN produces a frontal-facing 2.5D point cloud based on a training dataset captured with an RGBD camera. In our approach, the training process is offline, while the reconstruction runs in real-time. Our results show adequate reconstruction quality within the 'learned' expressions. Expressions not learned by the network produce artifacts and can trigger the Uncanny Valley effect.

What problem does this paper attempt to address?

The paper attempts to address the issue of achieving realistic face-to-face communication in Virtual Reality (VR). Specifically, when users wear Head-Mounted Displays (HMDs), the HMDs obscure their faces, making natural facial communication in VR environments difficult. To solve this problem, the paper proposes a real-time facial visualization pipeline based on neural rendering, aiming to achieve real-time, high-quality facial reconstruction using low-cost hardware, thereby improving remote presence and live broadcast experiences in VR. ### Main Objectives of the Paper: 1. **Achieve low-cost high-quality facial reconstruction**: Existing high-precision facial reconstruction methods typically require expensive laboratory equipment and substantial computational resources. The proposed method can run on a standard gaming computer, achieving real-time high-quality facial reconstruction with just a single GPU. 2. **Enhance the realism of facial expressions**: By using Generative Adversarial Networks (GANs), the proposed method can generate realistic 2.5D point clouds, enabling natural facial expressions and eye contact in virtual environments. 3. **Enhance the immersion of social interactions**: Through real-time facial reconstruction, social interactions in VR will become more natural and realistic, reducing the "uncanny valley" effect and increasing user immersion and engagement. ### Key Technologies of the Solution: - **Generative Adversarial Networks (GANs)**: The paper uses GANs to generate realistic 2.5D point clouds, trained on data captured from RGBD cameras. - **Multi-scale Discriminator**: To improve image quality and detail, the paper introduces a multi-scale discriminator, combined with feature matching loss and LPIPS loss functions. - **Low Hardware Cost**: The entire system can run on standard consumer-grade hardware, lowering the barrier to achieving high-quality facial reconstruction. - **Real-time Performance**: By optimizing the network architecture and loss functions, the proposed method can achieve real-time facial reconstruction and rendering while maintaining high image quality. ### Experimental Results: - **Quantitative Evaluation**: The paper uses Structural Similarity Index (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS) as evaluation metrics. The results show that the new method significantly outperforms previous systems in terms of image quality and detail retention. - **Qualitative Evaluation**: Through visual comparison, the facial images generated by the new method show significant improvements in detail and consistency, especially in the reconstruction of high-frequency details such as skin pores and facial hair. In summary, the paper effectively addresses the social interaction barriers caused by HMDs obscuring faces in VR environments by proposing a low-cost, high-performance facial reconstruction method based on GANs, providing a new technical pathway for future VR applications.

Towards a Pipeline for Real-Time Visualization of Faces for VR-based Telepresence and Live Broadcasting Utilizing Neural Rendering

FaceVR: Real-Time Facial Reenactment and Eye Gaze Control in Virtual Reality

Real-Time Audio-Guided Multi-Face Reenactment

VR Facial Animation for Immersive Telepresence Avatars

Facial performance sensing head-mounted display

4D Facial Avatar Reconstruction From Monocular Video via Efficient and Controllable Neural Radiance Fields

Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction

High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors

Neural Head Avatars from Monocular RGB Videos

Video-Driven Neural Physically-Based Facial Asset for Production

BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis

High-fidelity facial and speech animation for VR HMDs

HeadsetOff: Enabling Photorealistic Video Conferencing on Economical VR Headsets

AvatarMe++: Facial Shape and BRDF Inference With Photorealistic Rendering-Aware GANs

Universal Facial Encoding of Codec Avatars from VR Headsets

Rendering with style

VOODOO XP: Expressive One-Shot Head Reenactment for VR Telepresence

Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Attention-Based VR Facial Animation with Visual Mouth Camera Guidance for Immersive Telepresence Avatars

Expression-aware video inpainting for HMD removal in XR applications