Video Creation by Demonstration

Yihong Sun,Hao Zhou,Liangzhe Yuan,Jennifer J. Sun,Yandong Li,Xuhui Jia,Hartwig Adam,Bharath Hariharan,Long Zhao,Ting Liu
2024-12-13
Abstract:We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present $\delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, $\delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at <a class="link-external link-https" href="https://delta-diffusion.github.io/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: **How to generate a physically reasonable video through a given demonstration video and a context image of a different scene, and this video can naturally continue the context image and execute the action concept in the demonstration video**. Specifically, the paper proposes a new video creation experience - **Video Creation by Demonstration**. Users can generate a new video by providing a demonstration video showing the required action concept and a context image of the initial scene. This new video will integrate the action concepts in the demonstration video and ensure temporal and physical coherence. To achieve this goal, the paper introduces the 𝛿 - Diffusion model, which is a self - supervised training method that learns from unlabeled videos through conditional future - frame prediction. Different from most existing video generation control methods based on explicit signals, 𝛿 - Diffusion adopts an implicit latent control form to maximize the flexibility and expressiveness required for general - purpose videos. In addition, by using the appearance bottleneck design on the video base model, the action latent variables in the demonstration video are extracted, thereby minimizing appearance leakage during the generation process. ### The main contributions of the paper include: 1. **Introducing Video Creation by Demonstration**: A new controllable video generation creation experience that allows the direct use of videos as driving control signals to convey action concepts. 2. **Utilizing off - the - shelf video base models for the first time** for latent control of video generation. 3. **Proposing a new self - supervised training paradigm**, achieving remarkable controllable video generation results. Through these designs, the paper solves the challenges of existing methods in action concept transfer and video generation in complex scenarios, opening up new possibilities for interactive world simulation.