Abstract:People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial training losses/strategies, including relaxed motion loss, bicycle constraint, and diversity loss, are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than previous state-of-the-art methods, quantitatively and qualitatively. Besides, our formulation is compatible with discrete cosine transformation (DCT) modeling and other popular backbones (i.e., RNN, Transformer). As for motion losses and quantitative motion evaluation, we find structured losses/metrics (e.g. STFT) that consider temporal and/or spatial context complement the most commonly used point-wise losses (e.g. PCK), resulting in better motion dynamics and more nuanced motion details. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline.

Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

C2G2: Controllable Co-speech Gesture Generation with Latent Diffusion Model

Deep Gesture Video Generation with Learning on Regions of Interest

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Audio-Driven Co-Speech Gesture Video Generation

DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models

Salient Co-Speech Gesture Synthesizing with Discrete Motion Representation.

Co-Speech Gesture Synthesis using Discrete Gesture Token Learning

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures

SIGGesture: Generalized Co-Speech Gesture Synthesis via Semantic Injection with Large-Scale Pre-Training Diffusion Models

A Unified Editing Method for Co-Speech Gesture Generation via Diffusion Inversion

Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation

Audio2Gestures: Generating Diverse Gestures From Audio

Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis

Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

Self-Supervised Representation Learning With Spatial-Temporal Consistency for Sign Language Recognition