Abstract:While existing one-shot talking head generation models have achieved progress in coarse-grained emotion editing, there is still a lack of fine-grained emotion editing models with high interpretability. We argue that for an approach to be considered fine-grained, it needs to provide clear definitions and sufficiently detailed differentiation. We present LES-Talker, a novel one-shot talking head generation model with high interpretability, to achieve fine-grained emotion editing across emotion types, emotion levels, and facial units. We propose a Linear Emotion Space (LES) definition based on Facial Action Units to characterize emotion transformations as vector transformations. We design the Cross-Dimension Attention Net (CDAN) to deeply mine the correlation between LES representation and 3D model representation. Through mining multiple relationships across different feature and structure dimensions, we enable LES representation to guide the controllable deformation of 3D model. In order to adapt the multimodal data with deviations to the LES and enhance visual quality, we utilize specialized network design and training strategies. Experiments show that our method provides high visual quality along with multilevel and interpretable fine-grained emotion editing, outperforming mainstream methods.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deficiencies of existing one - shot Talking Head generation models in fine - grained emotion editing, which are specifically manifested as follows: 1. **Lack of interpretability**: Although many existing methods can achieve coarse - grained emotion editing, they lack a clear definition and detailed distinction of emotional changes. For example, some studies use facial expression reference images without a clear emotional definition, others use discrete emotional labels, and still others extract emotional features from the latent space, resulting in the implicitness of emotional transformation. 2. **Poor fine - grained emotion editing effect**: Although facial action units (AUs) can effectively describe emotions, different studies use different AU combinations to represent the same emotion, which leads to confusion in definition. In addition, relying solely on AUs is not sufficient to achieve fine - grained emotion editing. Some existing studies can only generate videos for specific emotions and are limited to coarse - grained emotion editing. To solve these problems, the paper proposes the **Linear Emotion Space (LES)** and the **LES - Talker** model. LES is an emotional linear space defined based on facial action units (AUs), which can clearly characterize emotional transformation and provide a highly interpretable theoretical basis. LES - Talker is a new one - shot Talking Head generation model that can achieve fine - grained emotion editing on multiple emotion types, emotion levels, and facial units. Specifically, the main contributions of LES - Talker include: - Proposing the Linear Emotion Space (LES), which provides an interpretable theoretical basis for fine - grained emotion editing. - Designing a new Cross - Dimension Attention Network (CDAN) to explore the potential associations between 3D model representations and LES representations, thereby achieving controllable deformation of 3D models. - Experimental results show that LES - Talker is superior to mainstream methods in terms of visual quality and multi - level fine - grained emotion editing. Through these improvements, LES - Talker not only improves the quality of generated videos but also achieves fine - grained control of multiple emotion types, solving the problems existing in existing methods.

LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space

Continuously Controllable Facial Expression Editing in Talking Face Videos

Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

A Continuous Emotional Editing Model for Talking Head Videos Based on Decoupling Texture and Geometry

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

Talking Face Generation With Audio-Deduced Emotional Landmarks

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis

Audio-driven Talking Face Video Generation with Natural Head Pose

Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis

Audio-driven High-resolution Seamless Talking Head Video Editing via StyleGAN

GMTalker: Gaussian Mixture Based Emotional Talking Video Portraits

EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis