SpeechAct: Towards Generating Whole-body Motion from Speech

Jinsong Zhang,Minjie Zhu,Yuxiang Zhang,Yebin Liu,Kun Li

2024-06-03

Abstract:This paper addresses the problem of generating whole-body motion from speech. Despite great successes, prior methods still struggle to produce reasonable and diverse whole-body motions from speech. This is due to their reliance on suboptimal representations and a lack of strategies for generating diverse results. To address these challenges, we present a novel hybrid point representation to achieve accurate and continuous motion generation, e.g., avoiding foot skating, and this representation can be transformed into an easy-to-use representation, i.e., SMPL-X body mesh, for many applications. To generate whole-body motion from speech, for facial motion, closely tied to the audio signal, we introduce an encoder-decoder architecture to achieve deterministic outcomes. However, for the body and hands, which have weaker connections to the audio signal, we aim to generate diverse yet reasonable motions. To boost diversity in motion generation, we propose a contrastive motion learning method to encourage the model to produce more distinctive representations. Specifically, we design a robust VQ-VAE to learn a quantized motion codebook using our hybrid representation. Then, we regress the motion representation from the audio signal by a translation model employing our contrastive motion learning method. Experimental results validate the superior performance and the correctness of our model. The project page is available for research purposes at <a class="link-external link-http" href="http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct" rel="external noopener nofollow">this http URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the problem of generating full-body motions from speech, particularly in the fields of computer graphics and immersive virtual reality (VR/AR). Existing methods face difficulties in generating reasonable and diverse full-body motions from speech. Specifically, previous methods often only generate partial body motions and use keypoint representations. Although these are easy to learn and include local details such as hand movements, they lead to inaccurate and unrealistic results when fitting or animating a complete 3D human model. Moreover, these methods tend to generate averaged motions, lacking diversity. To address these issues, this paper proposes a new method called SpeechAct. This method enhances the realism and diversity of generated motions based on a hybrid point representation and contrastive motion learning. The hybrid point representation combines the advantages of keypoint representation and surface points of the 3D human model, making it easy to learn while generating smooth and reasonable motions. Through the contrastive motion learning method, the model can distinguish motions generated from different audio and different speakers, thereby improving the diversity of the generated results. Experimental results show that the model can generate natural and diverse full-body motions and is applicable to different languages and music inputs.

SpeechAct: Towards Generating Whole-body Motion from Speech

Generating Holistic 3D Human Motion from Speech

Text-driven Visual Prosody Generation for Embodied Conversational Agents

Freeform Body Motion Generation from Speech

Towards Variable and Coordinated Holistic Co-Speech Motion Generation

Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation

Audio2Gestures: Generating Diverse Gestures From Audio

EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation

Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model

Generating coherent spontaneous speech and gesture from text

Salient Co-Speech Gesture Synthesizing with Discrete Motion Representation.

Cospeech body motion generation using a transformer

QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation

Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference

Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation

MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model

Audio-Driven Co-Speech Gesture Video Generation

Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation

Realistic Speech-Driven Talking Video Generation with Personalized Pose

Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers