T3M: Text Guided 3D Human Motion Synthesis from Speech

Wenshuo Peng,Kaipeng Zhang,Sai Qian Zhang

DOI: https://doi.org/10.18653/v1/2024.findings-naacl.74

2024-08-23

Abstract:Speech-driven 3D motion synthesis seeks to create lifelike animations based on human speech, with potential uses in virtual reality, gaming, and the film production. Existing approaches reply solely on speech audio for motion generation, leading to inaccurate and inflexible synthesis results. To mitigate this problem, we introduce a novel text-guided 3D human motion synthesis method, termed \textit{T3M}. Unlike traditional approaches, T3M allows precise control over motion synthesis via textual input, enhancing the degree of diversity and user customization. The experiment results demonstrate that T3M can greatly outperform the state-of-the-art methods in both quantitative metrics and qualitative evaluations. We have publicly released our code at \href{<a class="link-external link-https" href="https://github.com/Gloria2tt/T3M.git" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/Gloria2tt/T3M.git" rel="external noopener nofollow">this https URL</a>}

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issues of accuracy and flexibility in generating 3D human motions based on speech. Specifically, existing methods rely solely on speech audio to generate motions, resulting in inaccurate and inflexible synthesis outcomes. To solve this problem, the authors propose a new text-guided 3D human motion synthesis method called T3M (Text Guided 3D Motion). Unlike traditional methods, T3M allows precise control over motion synthesis through text input, thereby enhancing diversity and the possibility of user customization. The main contributions include: 1. Proposing a new speech-to-motion training framework, T3M, which achieves better control over the overall motion generated from audio through text input. 2. By aligning video and text in a joint embedding, utilizing video input for training, and using text descriptions during inference, the method significantly improves the diversity and performance of motion synthesis. 3. Experimental results show that the proposed T3M framework significantly outperforms existing methods in both quantitative and qualitative evaluations.

T3M: Text Guided 3D Human Motion Synthesis from Speech

Generating Holistic 3D Human Motion from Speech

TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis

MotionGPT: Human Motion Synthesis with Improved Diversity and Realism via GPT-3 Prompting

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

TextIM: Part-aware Interactive Motion Synthesis from Text

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

Fg-T2M: Fine-Grained Text-Driven Human Motion Generation Via Diffusion Model

Contact-aware Human Motion Generation from Textual Descriptions

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

Text-driven 3D Avatar Animation with Emotional and Expressive Behaviors.

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

3D Visible Speech Animation Driven by Prosody Text

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

"Mood Avatar: Automatic Text-Driven Head Motion Synthesis" International Conference on Multimodal Interfaces (ICMI2010)

FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

Mood avatar: automatic text-driven head motion synthesis

Generating Human Interaction Motions in Scenes with Text Control

HumanTOMATO: Text-aligned Whole-body Motion Generation