Abstract:We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .

Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

TempDiff: Enhancing Temporal‐awareness in Latent Diffusion for Real‐World Video Super‐Resolution

Exploiting Diffusion Prior for Real-World Image Super-Resolution

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

ED-T2V: an Efficient Training Framework for Diffusion-based Text-to-Video Generation.

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Photorealistic Video Generation with Diffusion Models

Towards Interpretable Video Super-Resolution Via Alternating Optimization

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

Video Diffusion Models

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Image Super-resolution Via Latent Diffusion: A Sampling-space Mixture Of Experts And Frequency-augmented Decoder Approach

AddSR: Accelerating Diffusion-based Blind Super-Resolution with Adversarial Diffusion Distillation

Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation