Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Ruining Li,Chuanxia Zheng,Christian Rupprecht,Andrea Vedaldi
2024-08-09
Abstract:We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics. At test time, given a single image and a sparse set of motion trajectories (i.e., drags), Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions. This is achieved by fine-tuning a large-scale pre-trained video diffusion model, for which we propose a new conditioning architecture to inject the dragging control effectively. More importantly, we introduce the all-to-first attention mechanism, a drop-in replacement for the widely adopted spatial attention modules, which significantly improves generation quality by addressing the appearance and background issues in existing models. Unlike other motion-conditioned video generators that are trained on in-the-wild videos and mostly move an entire object, Puppet-Master is learned from Objaverse-Animation-HQ, a new dataset of curated part-level motion clips. We propose a strategy to automatically filter out sub-optimal animations and augment the synthetic renderings with meaningful motion trajectories. Puppet-Master generalizes well to real images across various categories and outperforms existing methods in a zero-shot manner on a real-world benchmark. See our project page for more results: <a class="link-external link-http" href="http://vgg-puppetmaster.github.io" rel="external noopener nofollow">this http URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main goal of this paper is to develop an interactive video generation model called Puppet-Master, which can serve as a motion prior for internal dynamics of objects, particularly part-level dynamics. Specifically, Puppet-Master can synthesize video clips that demonstrate realistic part-level motion given an image and a set of sparse motion trajectories (i.e., drag instructions). To achieve this goal, the researchers employed the following key methods: 1. **Model Improvement**: Fine-tuning a pre-trained large-scale video diffusion model and proposing a new conditional architecture to effectively inject drag control information. Additionally, a "full-to-first-frame" attention mechanism was introduced to significantly enhance the quality of the generated videos. 2. **Dataset Construction**: Utilizing the Objaverse-Animation-HQ dataset, a new dataset containing carefully curated part-level motion clips. The synthetic rendering was enhanced by automatically filtering out suboptimal animations and adding meaningful motion trajectories. 3. **Experimental Validation**: Validating the effectiveness and superiority of Puppet-Master in multiple benchmarks, including good generalization to real-world data in zero-shot scenarios. In summary, this research aims to address the limitations in existing technologies, such as the difficulty of other models in capturing realistic part-level dynamics or issues when handling complex objects. Puppet-Master not only generates high-quality and physically plausible part-level dynamic videos but also generalizes well to different categories of objects and real-world scenes.