Abstract:Despite significant advancements in text-to-motion synthesis, generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language, 3D scenes, and human motion, and (ii) the generative models' intensive data requirements contrasted with the scarcity of comprehensive, high-quality, language-scene-motion datasets. To tackle these issues, we introduce a novel two-stage framework that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting explicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals, especially when training with limited data lacking extensive language-scene-motion pairs. Our extensive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE. Additionally, we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes.

What problem does this paper attempt to address?

The paper aims to address the challenge of generating language-guided human motions in 3D environments. Specifically, the paper focuses on the following two challenges: 1. **Description Fidelity and Physical Plausibility**: Ensuring that the generated motions not only align with the language descriptions but also exhibit physical plausibility in 3D scenes and can be accurately positioned at specific locations. Existing conditional generation models (such as conditional variational autoencoders [cVAE] and conditional diffusion models) struggle to handle the complex relationship between 3D scene grounding and conditional motion generation simultaneously, making it difficult for the models to generalize across different scenes and descriptions. 2. **High-Quality Data Requirement**: Generative models require a large amount of high-quality paired data for training, but existing HSI (Human-Scene Interaction) datasets lack in terms of motion quality and diversity, especially the absence of datasets combining language, scene, and motion. Although the HUMANISE dataset attempts to fill this gap, its limited motion types and use of fixed-form expressions restrict the generation of diverse HSI from varied and free-form language descriptions. To address these issues, the authors propose a novel two-stage framework that utilizes a scene operability map as an intermediate representation to effectively combine 3D scene grounding with conditional motion generation. This approach enhances 3D scene grounding by precisely defining the regions corresponding to language descriptions, achieving good results even with limited training data. Additionally, the distance-based operability map provides an in-depth understanding of the geometric interactions between the scene and human motions, aiding in the generation of HSI and improving the model's generalization ability across different scenes. Experimental results demonstrate that this method performs excellently on existing benchmarks and shows outstanding generalization performance in special evaluation sets containing unknown language descriptions and 3D scenes.

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

Generating Human Motion in 3D Scenes from Text Descriptions

LaserHuman: Language-guided Scene-aware Human Motion Generation in Free Environment

CoMA: Compositional Human Motion Generation with Multi-modal Agents

Generating Continual Human Motion in Diverse 3D Scenes

Autonomous Character-Scene Interaction Synthesis from Text Instruction

Synthesizing Diverse Human Motions in 3D Indoor Scenes

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

MotionGPT: Human Motion Synthesis with Improved Diversity and Realism via GPT-3 Prompting

Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models

AMD: Autoregressive Motion Diffusion

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

Purposer: Putting Human Motion Generation in Context

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Generating Human Interaction Motions in Scenes with Text Control

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling