Abstract:Video-driven 3D facial animation transfer aims to drive avatars to reproduce the expressions of actors. Existing methods have achieved remarkable results by constraining both geometric and perceptual consistency. However, geometric constraints (like those designed on facial landmarks) are insufficient to capture subtle emotions, while expression features trained on classification tasks lack fine granularity for complex emotions. To address this, we propose \textbf{FreeAvatar}, a robust facial animation transfer method that relies solely on our learned expression representation. Specifically, FreeAvatar consists of two main components: the expression foundation model and the facial animation transfer model. In the first component, we initially construct a facial feature space through a face reconstruction task and then optimize the expression feature space by exploring the similarities among different expressions. Benefiting from training on the amounts of unlabeled facial images and re-collected expression comparison dataset, our model adapts freely and effectively to any in-the-wild input facial images. In the facial animation transfer component, we propose a novel Expression-driven Multi-avatar Animator, which first maps expressive semantics to the facial control parameters of 3D avatars and then imposes perceptual constraints between the input and output images to maintain expression consistency. To make the entire process differentiable, we employ a trained neural renderer to translate rig parameters into corresponding images. Furthermore, unlike previous methods that require separate decoders for each avatar, we propose a dynamic identity injection module that allows for the joint training of multiple avatars within a single network.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve several key problems in video - driven 3D facial animation transfer: 1. **Insufficiency of geometric constraints**: Existing methods usually rely on facial geometric constraints (such as metrics based on facial landmarks), which are difficult to capture subtle expression changes, such as a slight frown and lip compression. 2. **Limitations of expression features**: Existing expression features are usually trained on limited - category emotion classification tasks, which results in their inability to capture the fine - grained differences of complex emotions. 3. **Natural and accurate expression transfer**: Maintaining facial emotional consistency while achieving natural and accurate expression transfer remains a challenge, especially when dealing with in - the - wild data. To address these problems, the authors propose **FreeAvatar**, a robust facial animation transfer method that only relies on expression representation. The core idea of FreeAvatar is to first learn a continuous and semantically - distinguishable expression representation, and then design an animation transfer model that can accurately decode the expression representation into the expression of the target character. ### Main contributions 1. **3D facial animation transfer relying only on expression representation**: FreeAvatar is the first 3D facial animation transfer method that completely relies on expression representation and can achieve high - fidelity 3D facial animation transfer when dealing with in - the - wild data. 2. **Expression - based model**: An expression - based model is introduced to construct a general, fine - grained, and continuous latent space, which is suitable for various faces, including stylized virtual characters. This model helps to maintain high - expression consistency during facial animation transfer. 3. **Expression - driven multi - character animator**: An expression - driven multi - character animator is designed, which can decode the expression representation into facial control parameters and maintain expression consistency. The dynamic identity injection module and identity - conditional loss enable this model to handle multiple characters in a single decoder. ### Method overview 1. **Expression - based model**: - **Construction of facial feature space**: Use Masked Autoencoder (MAE) to learn the intrinsic facial features from a large number of unlabeled facial images to enhance the generalization ability of the model. - **Optimization of expression feature space**: Fine - tune the pre - trained ViT encoder through contrastive learning to optimize the expression feature space. Specifically, use triplet loss to ensure that images with similar expressions are closer in the latent space and dissimilar images are farther apart. 2. **Expression - driven multi - character animator**: - **Extraction of expression features**: Based on the expression - based model, extract the expression representation from the source facial image. - **Dynamic identity injection**: Randomly assign target characters during the training process and dynamically inject them into the skeletal decoder and neural renderer. - **Skeletal parameter decoder**: Map the expression semantic information to the facial controllers of 3D characters. The generated skeletal parameters not only contain consistent expression information but also have unique facial properties. - **Neural renderer**: Convert the skeletal parameters into the facial images of the target character for expression supervision and capturing high - frequency details during the training process. - **Training objective**: Combine perceptual loss, generative adversarial loss, cycle - consistency loss, and identity - conditional loss to train the model in a semi - supervised manner. ### Experimental results Through extensive experiments, the authors verified the effectiveness of FreeAvatar when dealing with in - the - wild data. The experimental data includes large - scale unlabeled facial images, triplet datasets with expression - comparison annotations, and pairs of facial images with skeletal parameters. The experimental results show that FreeAvatar can achieve high - fidelity 3D facial animation transfer without introducing any geometric constraints. ### Conclusion FreeAvatar overcomes the limitations of existing methods in geometric constraints and expression features by learning fine - grained and continuous expression representations, and achieves robust 3D facial animation transfer. This method performs well when dealing with in - the - wild data and has broad application prospects.

FreeAvatar: Robust 3D Facial Animation Transfer by Learning an Expression Foundation Model

Video-driven state-aware facial animation

Facial Expression Retargeting from Human to Avatar Made Easy

AvatarReX: Real-time Expressive Full-body Avatars

ExpAvatar: High-Fidelity Avatar Generation of Unseen Expressions with 3D Face Priors

ECAvatar: 3D Avatar Facial Animation with Controllable Identity and Emotion

Universal Facial Encoding of Codec Avatars from VR Headsets

Video Tracked Facial Expression Animation

LBF Based 3D Regression for Facial Animation

Controllable high-fidelity facial performance transfer

Neuromuscular Control of the Face-Head-Neck Biomechanical Complex With Learning-Based Expression Transfer From Images and Videos

LatentAvatar: Learning Latent Expression Code for Expressive Neural Head Avatar

Real-time Conversion from a Single 2D Face Image to a 3D Text-Driven Emotive Audio-Visual Avatar

AniFaceDiff: Animating Stylized Avatars via Parametric Conditioned Diffusion Models

Expressive Whole-Body 3D Gaussian Avatar

GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image

X-Avatar: Expressive Human Avatars

AniArtAvatar: Animatable 3D Art Avatar from a Single Image

Real-time Synthesis of Chinese Visual Speech and Facial Expressions Using MPEG-4 FAP Features in a Three-Dimensional Avatar

Democratizing the Creation of Animatable Facial Avatars