Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer

Sigal Raab,Inbar Gat,Nathan Sala,Guy Tevet,Rotem Shalev-Arkushin,Ohad Fried,Amit H. Bermano,Daniel Cohen-Or
2024-06-11
Abstract:Given the remarkable results of motion synthesis with diffusion models, a natural question arises: how can we effectively leverage these models for motion editing? Existing diffusion-based motion editing methods overlook the profound potential of the prior embedded within the weights of pre-trained models, which enables manipulating the latent feature space; hence, they primarily center on handling the motion space. In this work, we explore the attention mechanism of pre-trained motion diffusion models. We uncover the roles and interactions of attention elements in capturing and representing intricate human motion patterns, and carefully integrate these elements to transfer a leader motion to a follower one while maintaining the nuanced characteristics of the follower, resulting in zero-shot motion transfer. Editing features associated with selected motions allows us to confront a challenge observed in prior motion diffusion approaches, which use general directives (e.g., text, music) for editing, ultimately failing to convey subtle nuances effectively. Our work is inspired by how a monkey closely imitates what it sees while maintaining its unique motion patterns; hence we call it Monkey See, Monkey Do, and dub it MoMo. Employing our technique enables accomplishing tasks such as synthesizing out-of-distribution motions, style transfer, and spatial editing. Furthermore, diffusion inversion is seldom employed for motions; as a result, editing efforts focus on generated motions, limiting the editability of real ones. MoMo harnesses motion inversion, extending its application to both real and generated motions. Experimental results show the advantage of our approach over the current art. In particular, unlike methods tailored for specific applications through training, our approach is applied at inference time, requiring no training. Our webpage is at <a class="link-external link-https" href="https://monkeyseedocg.github.io" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Graphics
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of how to effectively utilize diffusion models for action editing. Existing action editing methods based on diffusion models often overlook the prior information embedded in pre-trained models, which can manipulate the latent feature space. Therefore, these methods mainly focus on handling the action space. This paper explores the self-attention mechanism in pre-trained action diffusion models, revealing the role of attention elements in capturing and representing complex human action patterns, and integrates these elements to achieve zero-shot action transfer. Specifically, by manipulating features related to selected actions, this method can address the challenges present in previous diffusion methods, which only use general instructions (such as text, music) for editing and cannot effectively convey nuances. Inspired by the behavior of monkeys imitating actions, this paper names the technique "Monkey See, Monkey Do" (abbreviated as MoMo). This technique can perform various tasks during the inference phase without additional training, such as generating out-of-distribution actions, style transfer, and spatial editing. Moreover, MoMo leverages motion inversion techniques, extending its application to both real and generated actions. Experimental results show that this method has advantages at the current technological level, especially in applying to various tasks without the need for retraining.