Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Ruining Li,Chuanxia Zheng,Christian Rupprecht,Andrea Vedaldi

2024-08-09

Abstract:We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics. At test time, given a single image and a sparse set of motion trajectories (i.e., drags), Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions. This is achieved by fine-tuning a large-scale pre-trained video diffusion model, for which we propose a new conditioning architecture to inject the dragging control effectively. More importantly, we introduce the all-to-first attention mechanism, a drop-in replacement for the widely adopted spatial attention modules, which significantly improves generation quality by addressing the appearance and background issues in existing models. Unlike other motion-conditioned video generators that are trained on in-the-wild videos and mostly move an entire object, Puppet-Master is learned from Objaverse-Animation-HQ, a new dataset of curated part-level motion clips. We propose a strategy to automatically filter out sub-optimal animations and augment the synthetic renderings with meaningful motion trajectories. Puppet-Master generalizes well to real images across various categories and outperforms existing methods in a zero-shot manner on a real-world benchmark. See our project page for more results: <a class="link-external link-http" href="http://vgg-puppetmaster.github.io" rel="external noopener nofollow">this http URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The main goal of this paper is to develop an interactive video generation model called Puppet-Master, which can serve as a motion prior for internal dynamics of objects, particularly part-level dynamics. Specifically, Puppet-Master can synthesize video clips that demonstrate realistic part-level motion given an image and a set of sparse motion trajectories (i.e., drag instructions). To achieve this goal, the researchers employed the following key methods: 1. **Model Improvement**: Fine-tuning a pre-trained large-scale video diffusion model and proposing a new conditional architecture to effectively inject drag control information. Additionally, a "full-to-first-frame" attention mechanism was introduced to significantly enhance the quality of the generated videos. 2. **Dataset Construction**: Utilizing the Objaverse-Animation-HQ dataset, a new dataset containing carefully curated part-level motion clips. The synthetic rendering was enhanced by automatically filtering out suboptimal animations and adding meaningful motion trajectories. 3. **Experimental Validation**: Validating the effectiveness and superiority of Puppet-Master in multiple benchmarks, including good generalization to real-world data in zero-shot scenarios. In summary, this research aims to address the limitations in existing technologies, such as the difficulty of other models in capturing realistic part-level dynamics or issues when handling complex objects. Puppet-Master not only generates high-quality and physically plausible part-level dynamic videos but also generalizes well to different categories of objects and real-world scenes.

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Motion Prompting: Controlling Video Generation with Motion Trajectories

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Controllable Longer Image Animation with Diffusion Models

DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors

Image Comes Dancing With Collaborative Parsing-Flow Video Synthesis

MotionCrafter: One-Shot Motion Customization of Diffusion Models

3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

AnimateAnything: Consistent and Controllable Animation for Video Generation

MotionCraft: Physics-based Zero-Shot Video Generation

Motion Control for Enhanced Complex Action Video Generation

Purposer: Putting Human Motion Generation in Context

Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models

Generative Tweening: Long-term Inbetweening of 3D Human Motions

GANimator: Neural Motion Synthesis from a Single Sequence

Interaction Mix and Match: Synthesizing Close Interaction using Conditional Hierarchical GAN with Multi-Hot Class Embedding

Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation

DragVideo: Interactive Drag-style Video Editing