Abstract:Given the remarkable results of motion synthesis with diffusion models, a natural question arises: how can we effectively leverage these models for motion editing? Existing diffusion-based motion editing methods overlook the profound potential of the prior embedded within the weights of pre-trained models, which enables manipulating the latent feature space; hence, they primarily center on handling the motion space. In this work, we explore the attention mechanism of pre-trained motion diffusion models. We uncover the roles and interactions of attention elements in capturing and representing intricate human motion patterns, and carefully integrate these elements to transfer a leader motion to a follower one while maintaining the nuanced characteristics of the follower, resulting in zero-shot motion transfer. Editing features associated with selected motions allows us to confront a challenge observed in prior motion diffusion approaches, which use general directives (e.g., text, music) for editing, ultimately failing to convey subtle nuances effectively. Our work is inspired by how a monkey closely imitates what it sees while maintaining its unique motion patterns; hence we call it Monkey See, Monkey Do, and dub it MoMo. Employing our technique enables accomplishing tasks such as synthesizing out-of-distribution motions, style transfer, and spatial editing. Furthermore, diffusion inversion is seldom employed for motions; as a result, editing efforts focus on generated motions, limiting the editability of real ones. MoMo harnesses motion inversion, extending its application to both real and generated motions. Experimental results show the advantage of our approach over the current art. In particular, unlike methods tailored for specific applications through training, our approach is applied at inference time, requiring no training. Our webpage is at <a class="link-external link-https" href="https://monkeyseedocg.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of how to effectively utilize diffusion models for action editing. Existing action editing methods based on diffusion models often overlook the prior information embedded in pre-trained models, which can manipulate the latent feature space. Therefore, these methods mainly focus on handling the action space. This paper explores the self-attention mechanism in pre-trained action diffusion models, revealing the role of attention elements in capturing and representing complex human action patterns, and integrates these elements to achieve zero-shot action transfer. Specifically, by manipulating features related to selected actions, this method can address the challenges present in previous diffusion methods, which only use general instructions (such as text, music) for editing and cannot effectively convey nuances. Inspired by the behavior of monkeys imitating actions, this paper names the technique "Monkey See, Monkey Do" (abbreviated as MoMo). This technique can perform various tasks during the inference phase without additional training, such as generating out-of-distribution actions, style transfer, and spatial editing. Moreover, MoMo leverages motion inversion techniques, extending its application to both real and generated actions. Experimental results show that this method has advantages at the current technological level, especially in applying to various tasks without the need for retraining.

Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer

Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

Motion Prompting: Controlling Video Generation with Motion Trajectories

MotionEditor: Editing Video Motion via Content-Aware Diffusion

MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis

Towards motion from video diffusion models

Controllable Motion Synthesis and Reconstruction with Autoregressive Diffusion Models

Animate Your Motion: Turning Still Images into Dynamic Videos

Human Motion Diffusion as a Generative Prior

MotionCrafter: One-Shot Motion Customization of Diffusion Models

MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion

AnaMoDiff: 2D Analogical Motion Diffusion via Disentangled Denoising

Dreamix: Video Diffusion Models are General Video Editors

Human Motion Diffusion Model

TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting

MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation

Motion-Conditioned Diffusion Model for Controllable Video Synthesis

Understanding Text-driven Motion Synthesis with Keyframe Collaboration via Diffusion Models

Enhanced Fine-Grained Motion Diffusion for Text-Driven Human Motion Synthesis