Abstract:While existing methods for 3D face reconstruction from in-the-wild images excel at recovering the overall face shape, they commonly miss subtle, extreme, asymmetric, or rarely observed expressions. We improve upon these methods with SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics), which faithfully reconstructs expressive 3D faces from images. We identify two key limitations in existing methods: shortcomings in their self-supervised training formulation, and a lack of expression diversity in the training images. For training, most methods employ differentiable rendering to compare a predicted face mesh with the input image, along with a plethora of additional loss functions. This differentiable rendering loss not only has to provide supervision to optimize for 3D face geometry, camera, albedo, and lighting, which is an ill-posed optimization problem, but the domain gap between rendering and input image further hinders the learning process. Instead, SMIRK replaces the differentiable rendering with a neural rendering module that, given the rendered predicted mesh geometry, and sparsely sampled pixels of the input image, generates a face image. As the neural rendering gets color information from sampled image pixels, supervising with neural rendering-based reconstruction loss can focus solely on the geometry. Further, it enables us to generate images of the input identity with varying expressions while training. These are then utilized as input to the reconstruction model and used as supervision with ground truth geometry. This effectively augments the training data and enhances the generalization for diverse expressions. Our qualitative, quantitative and particularly our perceptual evaluations demonstrate that SMIRK achieves the new state-of-the art performance on accurate expression reconstruction. Project webpage:

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that when reconstructing 3D human faces from monocular images, existing methods perform poorly in recovering subtle, extreme, asymmetric or rare expressions. Specifically, although existing 3D face reconstruction methods are excellent at recovering the overall facial shape, they usually miss those subtle expression changes that are important for human perception. These problems mainly stem from two key limitations in existing methods: 1. **Deficiencies in the self - supervised training formula**: Most methods use differentiable rendering to compare the predicted face mesh with the input image and combine multiple additional loss functions. This differentiable rendering loss not only needs to provide supervision for 3D face geometry, camera, albedo and illumination, which is an ill - posed optimization problem in itself, but also the differences between the rendering domain and the input image further hinder the learning process. 2. **Lack of expression diversity in training images**: The training data of existing methods lacks diverse expressions, resulting in the model being difficult to generalize to complex or rare expressions. To solve these problems, the paper proposes SMIRK (Spatial Modeling for Image - based Reconstruction of Kinesics), which is a neural - synthesis - based analysis method that can more accurately reconstruct 3D human faces with rich expressions from images. SMIRK replaces the traditional differentiable rendering by introducing a neural rendering module, which generates face images using sparsely sampled input image pixels, thus focusing supervision on the geometric structure and avoiding the influence of appearance information. In addition, SMIRK also introduces an expression consistency loss, which enhances the diversity of training data by generating images of different expressions, thereby improving the model's generalization ability for diverse expressions. Through these improvements, SMIRK shows superior performance in both qualitative and quantitative evaluations, especially in accurately capturing subtle, extreme and asymmetric expressions.

3D Facial Expressions through Analysis-by-Neural-Synthesis

Dynamic 3D Facial Expression Reconstruction from Images

Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos

EMOCA: Emotion Driven Monocular Face Capture and Animation

AnimateMe: 4D Facial Expressions via Diffusion Models

Real-time Facial Expression Recognition "In The Wild'' by Disentangling 3D Expression from Identity

Near-realtime Facial Animation by Deep 3D Simulation Super-Resolution

High-Fidelity 3D Head Avatars Reconstruction through Spatially-Varying Expression Conditioned Neural Radiance Field

Fake It Without Making It: Conditioned Face Generation for Accurate 3D Face Reconstruction

Facial Expression Re-targeting from a Single Character

Rendering with style

ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling

Video-Driven Neural Physically-Based Facial Asset for Production

AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction

Neural Sign Reenactor: Deep Photorealistic Sign Language Retargeting

PIRenderer: Controllable Portrait Image Generation Via Semantic Neural Rendering

Neural Relighting and Expression Transfer On Video Portraits

Cafca: High-quality Novel View Synthesis of Expressive Faces from Casual Few-shot Captures

MA-NeRF: Motion-Assisted Neural Radiance Fields for Face Synthesis from Sparse Images

Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction

OFER: Occluded Face Expression Reconstruction