Abstract:Human motion generation is a long-standing problem, and scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data whose quantity is limited. Meanwhile, it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this framework, we disentangle human-scene interaction from motion synthesis during training and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. The proposed implicit policy optimizes the intermediate noised motion in a GAN Inversion manner to maintain motion continuity and control keyframe poses though the ControlNet branch and motion inpainting. For long-term motion synthesis, we introduce motion blending for stable transitions between multiple sub-tasks, where motions are fused in rotation power space and translation linear space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes. <a class="link-external link-https" href="https://jingyugong.github.io/DiffusionImplicitPolicy/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **Achieving scene - aware motion synthesis without paired motion - scene data**. Specifically, the existing scene - aware motion synthesis methods rely heavily on limited paired motion - scene data, which restricts their diversity and generalization ability. To address these issues, the authors propose a new framework - Diffusion Implicit Policy (DIP) for unpaired scene - aware motion synthesis. ### Main problems and challenges 1. **Limitations of paired data**: Most of the existing methods rely on paired motion - scene data, and the quantity of such data is limited, making it difficult to cover diverse scenarios when training the model. 2. **Poor generalization ability**: When trained on only a few specific scenarios, existing methods have difficulty generalizing to more diverse scenarios. 3. **Trade - off between motion naturalness and interaction rationality**: Many methods sacrifice the naturalness of motion while improving the rationality of interaction, and vice versa. ### Proposed solutions To meet the above challenges, the authors propose the following solutions: - **Decouple human - scene interaction from motion synthesis**: Separate human - scene interaction from motion synthesis during the training process, so that paired motion - scene data is no longer required. - **Introduce implicit policy optimization**: Introduce an interaction - based implicit policy during the inference process, and ensure the naturalness of motion and the rationality of interaction through iterative diffusion denoising and implicit policy optimization. - **Long - term motion synthesis**: For long - term motion synthesis involving multiple subtasks, adopt a time - varying action fusion method to ensure the continuity between historical and future actions. ### Main contributions 1. **Propose a brand - new framework DIP**: This framework transforms scene - aware motion synthesis into a joint optimization problem, ensuring the naturalness of motion and the rationality of interaction during this process. 2. **Improve the adjustment of sampling distribution**: Adjust the center of the sampling distribution in the form of GAN inversion during the denoising process to improve the rationality of interaction. 3. **Generate new actions based on historical constraints**: Synthesize long - term actions in the power space of rotation matrices through interpolation and action fusion to ensure a smooth transition between multiple subsequent tasks. In conclusion, this paper aims to solve the problem of the existing scene - aware motion synthesis methods' dependence on paired data by proposing the DIP framework, and improve the generalization ability of the model and the quality of the generated motion.

Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

Guided Motion Diffusion for Controllable Human Motion Synthesis

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis

Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis

Synthesizing Diverse Human Motions in 3D Indoor Scenes

Human Motion Diffusion as a Generative Prior

MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis

AAMDM: Accelerated Auto-regressive Motion Diffusion Model

Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes

3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy

MotionDiffuse: Text-Driven Human Motion Generation With Diffusion Model

Motion-Conditioned Diffusion Model for Controllable Video Synthesis

EMDM: Efficient Motion Diffusion Model for Fast and High-Quality Motion Generation

Robust Diffusion‐based Motion In‐betweening

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

Generating Continual Human Motion in Diverse 3D Scenes

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

Controllable Motion Synthesis and Reconstruction with Autoregressive Diffusion Models

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling