Abstract:Synthesizing interaction-involved human motions has been challenging due to the high complexity of 3D environments and the diversity of possible human behaviors within. We present LAMA, Locomotion-Action-MAnipulation, to synthesize natural and plausible long-term human movements in complex indoor environments. The key motivation of LAMA is to build a unified framework to encompass a series of everyday motions including locomotion, scene interaction, and object manipulation. Unlike existing methods that require motion data "paired" with scanned 3D scenes for supervision, we formulate the problem as a test-time optimization by using human motion capture data only for synthesis. LAMA leverages a reinforcement learning framework coupled with a motion matching algorithm for optimization, and further exploits a motion editing framework via manifold learning to cover possible variations in interaction and manipulation. Throughout extensive experiments, we demonstrate that LAMA outperforms previous approaches in synthesizing realistic motions in various challenging scenarios. Project page: <a class="link-external link-https" href="https://jiyewise.github.io/projects/LAMA/" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

### The Problem Addressed by the Paper This paper aims to tackle the challenging problem of synthesizing natural and reasonable long-duration human motions in complex 3D environments. Specifically: 1. **Limitations of Existing Methods**: - Current methods mostly focus on sub-problems, such as static pose modeling or interaction with a single target object. - Some recent methods attempt to synthesize dynamic interactive motions in real 3D scenes but require "paired" motion datasets (i.e., data capturing both motion and the surrounding 3D environment simultaneously), which limits these methods in terms of complexity and diversity coverage. 2. **Proposed New Method LAMA**: - LAMA (Locomotion-Action-Manipulation) is a unified framework capable of generating high-quality and realistic long-duration human motions, including walking, scene interaction, and object manipulation, within a given 3D scene. - Unlike existing methods, LAMA does not rely on motion datasets paired with 3D scenes but treats it as a test-time optimization problem, using only human motion capture data for synthesis. - LAMA combines a reinforcement learning framework and motion matching algorithm to generate motions through optimization and uses manifold learning to handle possible variations. 3. **Main Contributions**: - Proposed the first method capable of generating realistic long-duration motions, including walking, scene interaction, and object manipulation, in complex 3D scenes without requiring paired datasets. - An innovative test-time optimization framework that only requires human motion capture data, combining reinforcement learning and motion matching, with a reward mechanism designed to avoid collisions and interact with the scene. - Achieved state-of-the-art motion synthesis quality with durations close to 10 seconds. - Captured and organized a new high-quality motion capture dataset, including walking and actions (such as sitting down), suitable for motion matching. In summary, the main goal of this paper is to generate natural and reasonable long-duration human motions in complex 3D environments without the need for paired datasets.

Locomotion-Action-Manipulation: Synthesizing Human-Scene Interactions in Complex 3D Environments

Synthesizing Diverse Human Motions in 3D Indoor Scenes

Object Motion Guided Human Motion Synthesis

Human-Object Interaction from Human-Level Instructions

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

SIMS: Simulating Human-Scene Interactions with Real World Script Planning

Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes

HYPERmotion: Learning Hybrid Behavior Planning for Autonomous Loco-manipulation

3D Human Motion Synthesis Based on Nonlinear Manifold Learning

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

Controllable Human-Object Interaction Synthesis

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

Scene Synthesis from Human Motion

Human Motion Instruction Tuning

Generating Continual Human Motion in Diverse 3D Scenes

Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

CoMA: Compositional Human Motion Generation with Multi-modal Agents

The Wanderings of Odysseus in 3D Scenes

Synthesizing Moving People with 3D Control