Abstract:Incorporating a temporal dimension into pretrained image diffusion models for video generation is a prevalent approach. However, this method is computationally demanding and necessitates large-scale video datasets. More critically, the heterogeneity between image and video datasets often results in catastrophic forgetting of the image expertise. Recent attempts to directly extract video snippets from image diffusion models have somewhat mitigated these problems. Nevertheless, these methods can only generate brief video clips with simple movements and fail to capture fine-grained motion or non-grid deformation. In this paper, we propose a novel Zero-Shot video Sampling algorithm, denoted as $\mathcal{ZS}^2$, capable of directly sampling high-quality video clips from existing image synthesis methods, such as Stable Diffusion, without any training or optimization. Specifically, $\mathcal{ZS}^2$ utilizes the dependency noise model and temporal momentum attention to ensure content consistency and animation coherence, respectively. This ability enables it to excel in related tasks, such as conditional and context-specialized video generation and instruction-guided video editing. Experimental results demonstrate that $\mathcal{ZS}^2$ achieves state-of-the-art performance in zero-shot video generation, occasionally outperforming recent supervised methods. Homepage: \url{<a class="link-external link-https" href="https://densechen.github.io/zss/" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The paper primarily aims to address the following issues: 1. **Zero-shot video generation**: A new zero-shot video sampling algorithm (ZS2) is proposed, which can generate high-quality video clips directly from pre-trained image diffusion models without any training or optimization. This addresses the problem of existing methods requiring large-scale video datasets for training and also alleviates the issue of forgetting image expert knowledge in video generation tasks. 2. **Improving video content consistency and coherence**: By introducing a dependency noise model and a temporal momentum attention mechanism to ensure the content consistency of the generated video and the coherence of the animation. The dependency noise model is used to ensure the consistency of object appearance, while the temporal momentum attention is used to maintain the coherence of motion and the identity of foreground objects. 3. **Flexibility and control**: ZS2 can not only generate simple video clips but also handle complex motion changes, such as non-rigid deformations, smoke diffusion effects, etc., and can flexibly control the speed and diversity of video content changes by adjusting parameters. 4. **Generality and applicability**: This method can be applied not only to text-to-video synthesis tasks but also to conditional video generation, specific scene video generation, and instruction-guided video editing tasks. 5. **Performance evaluation and comparison**: The paper demonstrates the advanced performance of ZS2 in zero-shot video generation tasks, especially in comparison with recent supervised learning methods. Additionally, by comparing with the results of other video diffusion models, it proves the advantage of ZS2 in terms of image quality. In summary, this research aims to simplify and reduce the cost-intensive video generation process by proposing a novel zero-shot video sampling algorithm while maintaining or enhancing the quality of the generated videos.

Fine-gained Zero-shot Video Sampling

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

Efficient and consistent zero-shot video generation with diffusion models

StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models

Visual Data Synthesis Via GAN for Zero-Shot Video Classification

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Fine-Grained Feature Generation for Generalized Zero-Shot Video Classification

Zero-Shot Video Editing through Adaptive Sliding Score Distillation

Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation

Zero-Shot Learning Using Synthesised Unseen Visual Data with Diffusion Regularisation

IV-Mixed Sampler: Leveraging Image Diffusion Models for Enhanced Video Synthesis

Exploring Data Efficiency in Zero-Shot Learning with Diffusion Models

Zero-shot Video Restoration and Enhancement Using Pre-Trained Image Diffusion Model

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing

OSV: One Step is Enough for High-Quality Image to Video Generation