Abstract:Deriving co-speech 3D gestures has seen tremendous progress in virtual avatar animation. Yet, the existing methods often produce stiff and unreasonable gestures with unseen human speech inputs due to the limited 3D speech-gesture data. In this paper, we propose CoCoGesture, a novel framework enabling vivid and diverse gesture synthesis from unseen human speech prompts. Our key insight is built upon the custom-designed pretrain-fintune training paradigm. At the pretraining stage, we aim to formulate a large generalizable gesture diffusion model by learning the abundant postures manifold. Therefore, to alleviate the scarcity of 3D data, we first construct a large-scale co-speech 3D gesture dataset containing more than 40M meshed posture instances across 4.3K speakers, dubbed GES-X. Then, we scale up the large unconditional diffusion model to 1B parameters and pre-train it to be our gesture experts. At the finetune stage, we present the audio ControlNet that incorporates the human voice as condition prompts to guide the gesture generation. Here, we construct the audio ControlNet through a trainable copy of our pre-trained diffusion model. Moreover, we design a novel Mixture-of-Gesture-Experts (MoGE) block to adaptively fuse the audio embedding from the human speech and the gesture features from the pre-trained gesture experts with a routing mechanism. Such an effective manner ensures audio embedding is temporal coordinated with motion features while preserving the vivid and diverse gesture generation. Extensive experiments demonstrate that our proposed CoCoGesture outperforms the state-of-the-art methods on the zero-shot speech-to-gesture generation. The dataset will be publicly available at: <a class="link-external link-https" href="https://mattie-e.github.io/GES-X/" rel="external noopener nofollow">this https URL</a>

Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

Speech2AffectiveGestures: Synthesizing Co-Speech Gestures with Generative Adversarial Affective Expression Learning

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control

CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild

Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents

C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model

MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation

Co-Speech Gesture Synthesis using Discrete Gesture Token Learning

Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference

Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings

AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis

UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons

Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis

Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs

Salient Co-Speech Gesture Synthesizing with Discrete Motion Representation.

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation