Abstract:Producing emotionally dynamic 3D facial avatars with text derived from spoken words (Emo3D) has been a pivotal research topic in 3D avatar generation. While progress has been made in general-purpose 3D avatar generation, the exploration of generating emotional 3D avatars remains scarce, primarily due to the complexities of identifying and rendering rich emotions from spoken words. This paper reexamines Emo3D generation and draws inspiration from human processes, breaking down Emo3D into two cascading steps: Text-to-3D Expression Mapping (T3DEM) and 3D Avatar Rendering (3DAR). T3DEM is the most crucial step in determining the quality of Emo3D generation and encompasses three key challenges: Expression Diversity, Emotion-Content Consistency, and Expression Fluidity. To address these challenges, we introduce a novel benchmark to advance research in Emo3D generation. First, we present EmoAva, a large-scale, high-quality dataset for T3DEM, comprising 15,000 text-to-3D expression mappings that characterize the aforementioned three challenges in Emo3D generation. Furthermore, we develop various metrics to effectively evaluate models against these identified challenges. Next, to effectively model the consistency, diversity, and fluidity of human expressions in the T3DEM step, we propose the Continuous Text-to-Expression Generator, which employs an autoregressive Conditional Variational Autoencoder for expression code generation, enhanced with Latent Temporal Attention and Expression-wise Attention mechanisms. Finally, to further enhance the 3DAR step on rendering higher-quality subtle expressions, we present the Globally-informed Gaussian Avatar (GiGA) model. GiGA incorporates a global information mechanism into 3D Gaussian representations, enabling the capture of subtle micro-expressions and seamless transitions between emotional states.

Real-time Conversion from a Single 2D Face Image to a 3D Text-Driven Emotive Audio-Visual Avatar

Real-time Synthesis of Chinese Visual Speech and Facial Expressions Using MPEG-4 FAP Features in a Three-Dimensional Avatar

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

EmoFace: Audio-driven Emotional 3D Face Animation

Real-time Synthesis of Chinese Visua using MPEG-4 FAP Features in a

Text to Avatar in Multi-modal Human Computer Interface

Real-time Facial Animation with Image-Based Dynamic Avatars.

Facial Expression Retargeting from Human to Avatar Made Easy

Towards Rich Emotions in 3D Avatars: A Text-to-3D Avatar Generation Benchmark

ECAvatar: 3D Avatar Facial Animation with Controllable Identity and Emotion

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Audio-Driven Emotional 3D Talking-Head Generation

Democratizing the Creation of Animatable Facial Avatars

READ Avatars: Realistic Emotion-controllable Audio Driven Avatars

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

FreeAvatar: Robust 3D Facial Animation Transfer by Learning an Expression Foundation Model

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Universal Facial Encoding of Codec Avatars from VR Headsets

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

Facial Expression Synthesis Based on Emotion Dimensions for Affective Talking Avatar