Abstract:We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present $\delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, $\delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at <a class="link-external link-https" href="https://delta-diffusion.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **How to generate a physically reasonable video through a given demonstration video and a context image of a different scene, and this video can naturally continue the context image and execute the action concept in the demonstration video**. Specifically, the paper proposes a new video creation experience - **Video Creation by Demonstration**. Users can generate a new video by providing a demonstration video showing the required action concept and a context image of the initial scene. This new video will integrate the action concepts in the demonstration video and ensure temporal and physical coherence. To achieve this goal, the paper introduces the 𝛿 - Diffusion model, which is a self - supervised training method that learns from unlabeled videos through conditional future - frame prediction. Different from most existing video generation control methods based on explicit signals, 𝛿 - Diffusion adopts an implicit latent control form to maximize the flexibility and expressiveness required for general - purpose videos. In addition, by using the appearance bottleneck design on the video base model, the action latent variables in the demonstration video are extracted, thereby minimizing appearance leakage during the generation process. ### The main contributions of the paper include: 1. **Introducing Video Creation by Demonstration**: A new controllable video generation creation experience that allows the direct use of videos as driving control signals to convey action concepts. 2. **Utilizing off - the - shelf video base models for the first time** for latent control of video generation. 3. **Proposing a new self - supervised training paradigm**, achieving remarkable controllable video generation results. Through these designs, the paper solves the challenges of existing methods in action concept transfer and video generation in complex scenarios, opening up new possibilities for interactive world simulation.

Video Creation by Demonstration

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Latent Video Diffusion Models for High-Fidelity Long Video Generation

ControlVideo: Training-free Controllable Text-to-Video Generation

Imagen Video: High Definition Video Generation with Diffusion Models

Video Diffusion Models

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

LaMD: Latent Motion Diffusion for Video Generation

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

Controllable Longer Image Animation with Diffusion Models

Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation

Animate Your Motion: Turning Still Images into Dynamic Videos

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control