Abstract:We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms the superiority of DiffSHEG over prior approaches. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the problem of real-time 3D expression and gesture generation driven by speech. Specifically, its goal is to generate synchronized and natural expressions and gesture animations given speech input. Previous research typically handled the generation of expressions or gestures separately, whereas this paper proposes a joint generation method aimed at better simulating the intrinsic connection between expressions and gestures during human communication. The proposed solution in the paper is named DiffSHEG (Diffusion-based Speech-driven Holistic 3D Expression and Gesture generation), which is a diffusion model-based approach capable of real-time speech-driven expression and gesture generation of any length. This method captures the joint distribution between expressions and gestures by introducing a unidirectional information flow design from expressions to gestures and utilizes a diffusion model to generate high-quality, diverse, and synchronized expressions and gestures. Additionally, the method introduces a sampling strategy called Fast Out-Painting to enhance the flexibility and computational efficiency of long-sequence generation. The main contributions of the paper include: 1. Proposing a unified diffusion model framework, DiffSHEG, for joint 3D expression and gesture generation driven by speech. 2. Designing a unidirectional expression-gesture Transformer generator (UniEG) to enforce the information flow from expressions to gestures, thereby better capturing their joint distribution. 3. Introducing the Fast Out-Painting Partial Autoregressive Sampling (FOPPAS) method to support efficient real-time generation of sequences of any length. 4. Evaluating the method on two public datasets, demonstrating its superior performance in both quantitative and qualitative aspects. In short, DiffSHEG aims to provide more realistic, synchronized, and diverse expression and gesture generation solutions for the development of digital humans and embodied agents through advanced diffusion model techniques.

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Conversational Co-Speech Gesture Generation via Modeling Dialog Intention, Emotion, and Context with Diffusion Models

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

EMoG: Synthesizing Emotive Co-speech 3D Gesture with Diffusion Model

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

DiffuGesture: Generating Human Gesture from Two-person Dialogue with Diffusion Models

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

Joint Co-Speech Gesture and Expressive Talking Face Generation using Diffusion with Adapters

SIGGesture: Generalized Co-Speech Gesture Synthesis via Semantic Injection with Large-Scale Pre-Training Diffusion Models

A Unified Editing Method for Co-Speech Gesture Generation via Diffusion Inversion