SpeechAct: Towards Generating Whole-body Motion from Speech

Jinsong Zhang,Minjie Zhu,Yuxiang Zhang,Yebin Liu,Kun Li
2024-06-03
Abstract:This paper addresses the problem of generating whole-body motion from speech. Despite great successes, prior methods still struggle to produce reasonable and diverse whole-body motions from speech. This is due to their reliance on suboptimal representations and a lack of strategies for generating diverse results. To address these challenges, we present a novel hybrid point representation to achieve accurate and continuous motion generation, e.g., avoiding foot skating, and this representation can be transformed into an easy-to-use representation, i.e., SMPL-X body mesh, for many applications. To generate whole-body motion from speech, for facial motion, closely tied to the audio signal, we introduce an encoder-decoder architecture to achieve deterministic outcomes. However, for the body and hands, which have weaker connections to the audio signal, we aim to generate diverse yet reasonable motions. To boost diversity in motion generation, we propose a contrastive motion learning method to encourage the model to produce more distinctive representations. Specifically, we design a robust VQ-VAE to learn a quantized motion codebook using our hybrid representation. Then, we regress the motion representation from the audio signal by a translation model employing our contrastive motion learning method. Experimental results validate the superior performance and the correctness of our model. The project page is available for research purposes at <a class="link-external link-http" href="http://cic.tju.edu.cn/faculty/likun/projects/SpeechAct" rel="external noopener nofollow">this http URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of generating full-body motions from speech, particularly in the fields of computer graphics and immersive virtual reality (VR/AR). Existing methods face difficulties in generating reasonable and diverse full-body motions from speech. Specifically, previous methods often only generate partial body motions and use keypoint representations. Although these are easy to learn and include local details such as hand movements, they lead to inaccurate and unrealistic results when fitting or animating a complete 3D human model. Moreover, these methods tend to generate averaged motions, lacking diversity. To address these issues, this paper proposes a new method called SpeechAct. This method enhances the realism and diversity of generated motions based on a hybrid point representation and contrastive motion learning. The hybrid point representation combines the advantages of keypoint representation and surface points of the 3D human model, making it easy to learn while generating smooth and reasonable motions. Through the contrastive motion learning method, the model can distinguish motions generated from different audio and different speakers, thereby improving the diversity of the generated results. Experimental results show that the model can generate natural and diverse full-body motions and is applicable to different languages and music inputs.