Audio-Driven Co-Speech Gesture Video Generation

Xian Liu,Qianyi Wu,Hang Zhou,Yuanqi Du,Wayne Wu,Dahua Lin,Ziwei Liu

DOI: https://doi.org/10.48550/arXiv.2212.02350

2022-12-05

Computer Vision and Pattern Recognition

Abstract:Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e.g., 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i.e., using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to generate a co - language gesture video synchronized with speech given speech audio. Specifically, the authors focus on directly generating the gesture video sequence of the speaker in the image domain, rather than only generating human skeleton data as in previous works. The challenge lies in the need to develop a unified framework that can be driven by speech audio to generate the image sequence of the speaker, and the generated video should not only maintain high fidelity but also naturally reflect the co - language gestures of the speaker. The main contributions of the paper can be summarized as follows: 1. **Explored the problem of audio - driven co - language gesture video generation**: This is the first research to propose generating co - language gesture videos in the image domain using a unified framework without structured human priors (such as 2D or 3D skeletons). 2. **Proposed the VQ - Motion Extractor (Vector Quantized Motion Extractor)**: It is used to quantize motion representations into common gesture patterns and extract these patterns through a quantization codebook. This step helps to capture information about commonly used gesture patterns. 3. **Designed the Co - Speech GPT (Co - Language Gesture Generation Pretrained Model)**: It is used to predict discrete gesture patterns from speech audio and supplement subtle rhythmic details through a motion refinement network to achieve fine - grained results. Through the above methods, the framework ANGIE proposed in the paper can generate realistic and vivid co - language gesture videos given speech audio and an initial image. This research not only promotes the technological development in the fields of human - computer interaction and digital entertainment but also provides new ideas and technical support for future research in related fields.

Audio-Driven Co-Speech Gesture Video Generation

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

Text-driven Visual Prosody Generation for Embodied Conversational Agents

EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

Audio-driven Neural Gesture Reenactment with Video Motion Graphs

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis

Audio2Gestures: Generating Diverse Gestures From Audio

QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation

Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

Salient Co-Speech Gesture Synthesizing with Discrete Motion Representation.

Audio-driven Talking Face Video Generation with Natural Head Pose

Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings

Realistic Speech-Driven Talking Video Generation with Personalized Pose

Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference

CoCoGesture: Toward Coherent Co-speech 3D Gesture Generation in the Wild