Abstract:Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that emotion is one of the key factors of authentic co-speech gesture generation. In this work, we propose EmotionGesture, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures. Our code and dataset will be released at the project page: <a class="link-external link-https" href="https://xingqunqi-lab.github.io/Emotion-Gesture-Web/" rel="external noopener nofollow">this https URL</a>

WeCard: a multimodal solution for making personalized electronic greeting cards.

AAML Based Avatar Animation with Personalized Expression for Online Chatting System

TellMeTalk: Multimodal-driven talking face video generation

Audio-driven Talking Face Video Generation with Natural Head Pose

Real-time Conversion from a Single 2D Face Image to a 3D Text-Driven Emotive Audio-Visual Avatar

Happy Companion : A System of Multimodal Human-Computer Affective Interaction

EmoFace: Audio-driven Emotional 3D Face Animation

FaceMe: an Augmented Reality Social Agent Game for Facilitating Children's Learning about Emotional Expressions

Multimodal emotion estimation and emotional synthesize for interaction virtual agent

Real-time Synthesis of Chinese Visual Speech and Facial Expressions Using MPEG-4 FAP Features in a Three-Dimensional Avatar

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Facial Expression Synthesis Based on Emotion Dimensions for Affective Talking Avatar

High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

TalkingAndroid: an Interactive, Multimodal and Real-Time Talking Avatar Application on Mobile Phones

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Text to Avatar in Multi-modal Human Computer Interface

Creative cartoon face synthesis system for mobile entertainment

Head and Facial Gestures Synthesis Using PAD Model for an Expressive Talking Avatar

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

ECAvatar: 3D Avatar Facial Animation with Controllable Identity and Emotion

EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation