Abstract:Producing emotionally dynamic 3D facial avatars with text derived from spoken words (Emo3D) has been a pivotal research topic in 3D avatar generation. While progress has been made in general-purpose 3D avatar generation, the exploration of generating emotional 3D avatars remains scarce, primarily due to the complexities of identifying and rendering rich emotions from spoken words. This paper reexamines Emo3D generation and draws inspiration from human processes, breaking down Emo3D into two cascading steps: Text-to-3D Expression Mapping (T3DEM) and 3D Avatar Rendering (3DAR). T3DEM is the most crucial step in determining the quality of Emo3D generation and encompasses three key challenges: Expression Diversity, Emotion-Content Consistency, and Expression Fluidity. To address these challenges, we introduce a novel benchmark to advance research in Emo3D generation. First, we present EmoAva, a large-scale, high-quality dataset for T3DEM, comprising 15,000 text-to-3D expression mappings that characterize the aforementioned three challenges in Emo3D generation. Furthermore, we develop various metrics to effectively evaluate models against these identified challenges. Next, to effectively model the consistency, diversity, and fluidity of human expressions in the T3DEM step, we propose the Continuous Text-to-Expression Generator, which employs an autoregressive Conditional Variational Autoencoder for expression code generation, enhanced with Latent Temporal Attention and Expression-wise Attention mechanisms. Finally, to further enhance the 3DAR step on rendering higher-quality subtle expressions, we present the Globally-informed Gaussian Avatar (GiGA) model. GiGA incorporates a global information mechanism into 3D Gaussian representations, enabling the capture of subtle micro-expressions and seamless transitions between emotional states.

Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Towards Rich Emotions in 3D Avatars: A Text-to-3D Avatar Generation Benchmark

ECAvatar: 3D Avatar Facial Animation with Controllable Identity and Emotion

EmoFace: Audio-driven Emotional 3D Face Animation

4DME: A Spontaneous 4D Micro-Expression Dataset with Multimodalities

EMOCA: Emotion Driven Monocular Face Capture and Animation

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

EmoGen: Quantifiable Emotion Generation and Analysis for Experimental Psychology

Real-time Facial Expression Recognition "In The Wild'' by Disentangling 3D Expression from Identity

EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes

Generating Dataset For Large-scale 3D Facial Emotion Recognition

2D/3D Expression Generation Using Advanced Learning Techniques and the Emotion Wheel

MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation

EmoLLM: Multimodal Emotional Understanding Meets Large Language Models

C3I-SynFace: A synthetic head pose and facial depth dataset using seed virtual human models.

DEITalk: Speech-Driven 3D Facial Animation with Dynamic Emotional Intensity Modeling

MIGMA: The Facial Emotion Image Dataset for Human Expression Recognition

M$^3$Face: A Unified Multi-Modal Multilingual Framework for Human Face Generation and Editing