Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Zan Wang,Yixin Chen,Baoxiong Jia,Puhao Li,Jinlu Zhang,Jingze Zhang,Tengyu Liu,Yixin Zhu,Wei Liang,Siyuan Huang
2024-03-27
Abstract:Despite significant advancements in text-to-motion synthesis, generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language, 3D scenes, and human motion, and (ii) the generative models' intensive data requirements contrasted with the scarcity of comprehensive, high-quality, language-scene-motion datasets. To tackle these issues, we introduce a novel two-stage framework that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting explicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals, especially when training with limited data lacking extensive language-scene-motion pairs. Our extensive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE. Additionally, we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the challenge of generating language-guided human motions in 3D environments. Specifically, the paper focuses on the following two challenges: 1. **Description Fidelity and Physical Plausibility**: Ensuring that the generated motions not only align with the language descriptions but also exhibit physical plausibility in 3D scenes and can be accurately positioned at specific locations. Existing conditional generation models (such as conditional variational autoencoders [cVAE] and conditional diffusion models) struggle to handle the complex relationship between 3D scene grounding and conditional motion generation simultaneously, making it difficult for the models to generalize across different scenes and descriptions. 2. **High-Quality Data Requirement**: Generative models require a large amount of high-quality paired data for training, but existing HSI (Human-Scene Interaction) datasets lack in terms of motion quality and diversity, especially the absence of datasets combining language, scene, and motion. Although the HUMANISE dataset attempts to fill this gap, its limited motion types and use of fixed-form expressions restrict the generation of diverse HSI from varied and free-form language descriptions. To address these issues, the authors propose a novel two-stage framework that utilizes a scene operability map as an intermediate representation to effectively combine 3D scene grounding with conditional motion generation. This approach enhances 3D scene grounding by precisely defining the regions corresponding to language descriptions, achieving good results even with limited training data. Additionally, the distance-based operability map provides an in-depth understanding of the geometric interactions between the scene and human motions, aiding in the generation of HSI and improving the model's generalization ability across different scenes. Experimental results demonstrate that this method performs excellently on existing benchmarks and shows outstanding generalization performance in special evaluation sets containing unknown language descriptions and 3D scenes.