Audio-Driven Co-Speech Gesture Video Generation

Xian Liu,Qianyi Wu,Hang Zhou,Yuanqi Du,Wayne Wu,Dahua Lin,Ziwei Liu
DOI: https://doi.org/10.48550/arXiv.2212.02350
2022-12-05
Computer Vision and Pattern Recognition
Abstract:Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e.g., 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i.e., using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to generate a co - language gesture video synchronized with speech given speech audio. Specifically, the authors focus on directly generating the gesture video sequence of the speaker in the image domain, rather than only generating human skeleton data as in previous works. The challenge lies in the need to develop a unified framework that can be driven by speech audio to generate the image sequence of the speaker, and the generated video should not only maintain high fidelity but also naturally reflect the co - language gestures of the speaker. The main contributions of the paper can be summarized as follows: 1. **Explored the problem of audio - driven co - language gesture video generation**: This is the first research to propose generating co - language gesture videos in the image domain using a unified framework without structured human priors (such as 2D or 3D skeletons). 2. **Proposed the VQ - Motion Extractor (Vector Quantized Motion Extractor)**: It is used to quantize motion representations into common gesture patterns and extract these patterns through a quantization codebook. This step helps to capture information about commonly used gesture patterns. 3. **Designed the Co - Speech GPT (Co - Language Gesture Generation Pretrained Model)**: It is used to predict discrete gesture patterns from speech audio and supplement subtle rhythmic details through a motion refinement network to achieve fine - grained results. Through the above methods, the framework ANGIE proposed in the paper can generate realistic and vivid co - language gesture videos given speech audio and an initial image. This research not only promotes the technological development in the fields of human - computer interaction and digital entertainment but also provides new ideas and technical support for future research in related fields.