T3M: Text Guided 3D Human Motion Synthesis from Speech

Wenshuo Peng,Kaipeng Zhang,Sai Qian Zhang
DOI: https://doi.org/10.18653/v1/2024.findings-naacl.74
2024-08-23
Abstract:Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, leading to inaccurate and inflexible synthesis results. To mitigate this problem, we introduce a novel text-guided 3D human motion synthesis method, termed \textit{T3M}. Unlike traditional approaches, T3M allows precise control over motion synthesis via textual input, enhancing the degree of diversity and user customization. The experiment results demonstrate that T3M can greatly outperform the state-of-the-art methods in both quantitative metrics and qualitative evaluations. We have publicly released our code at \href{<a class="link-external link-https" href="https://github.com/Gloria2tt/T3M.git" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/Gloria2tt/T3M.git" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the issues of accuracy and flexibility in generating 3D human motions based on speech. Specifically, existing methods rely solely on speech audio to generate motions, resulting in inaccurate and inflexible synthesis outcomes. To solve this problem, the authors propose a new text-guided 3D human motion synthesis method called T3M (Text Guided 3D Motion). Unlike traditional methods, T3M allows precise control over motion synthesis through text input, thereby enhancing diversity and the possibility of user customization. The main contributions include: 1. Proposing a new speech-to-motion training framework, T3M, which achieves better control over the overall motion generated from audio through text input. 2. By aligning video and text in a joint embedding, utilizing video input for training, and using text descriptions during inference, the method significantly improves the diversity and performance of motion synthesis. 3. Experimental results show that the proposed T3M framework significantly outperforms existing methods in both quantitative and qualitative evaluations.