Abstract:Animating virtual avatars to make co-speech gestures facilitates various applications in human-machine interaction. The existing methods mainly rely on generative adversarial networks (GANs), which typically suffer from notorious mode collapse and unstable training, thus making it difficult to learn accurate audio-gesture joint distributions. In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation. Specifically, we first establish the diffusion-conditional generation process on clips of skeleton sequences and audio to enable the whole framework. Then, a novel Diffusion Audio-Gesture Transformer is devised to better attend to the information from multiple modalities and model the long-term temporal dependency. Moreover, to eliminate temporal inconsistency, we propose an effective Diffusion Gesture Stabilizer with an annealed noise sampling strategy. Benefiting from the architectural advantages of diffusion models, we further incorporate implicit classifier-free guidance to trade off between diversity and gesture quality. Extensive experiments demonstrate that DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations. Code is available at <a class="link-external link-https" href="https://github.com/Advocate99/DiffGesture" rel="external noopener nofollow">this https URL</a>.

Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

EMoG: Synthesizing Emotive Co-speech 3D Gesture with Diffusion Model

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

Text-driven Visual Prosody Generation for Embodied Conversational Agents

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

A Unified Editing Method for Co-Speech Gesture Generation via Diffusion Inversion

SIGGesture: Generalized Co-Speech Gesture Synthesis via Semantic Injection with Large-Scale Pre-Training Diffusion Models

C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

DiffusionDialog: A Diffusion Model for Diverse Dialog Generation with Latent Space

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model