Abstract:Following the advancements in text-guided image generation technology exemplified by Stable Diffusion, video generation is gaining increased attention in the academic community. However, relying solely on text guidance for video generation has serious limitations, as videos contain much richer content than images, especially in terms of motion. This information can hardly be adequately described with plain text. Fortunately, in computer vision, various visual representations can serve as additional control signals to guide generation. With the help of these signals, video generation can be controlled in finer detail, allowing for greater flexibility for different applications. Integrating various controls, however, is nontrivial. In this paper, we propose a universal framework called EasyControl. By propagating and injecting condition features through condition adapters, our method enables users to control video generation with a single condition map. With our framework, various conditions including raw pixels, depth, HED, etc., can be integrated into different Unet-based pre-trained video diffusion models at a low practical cost. We conduct comprehensive experiments on public datasets, and both quantitative and qualitative results indicate that our method outperforms state-of-the-art methods. EasyControl significantly improves various evaluation metrics across multiple validation datasets compared to previous works. Specifically, for the sketch-to-video generation task, EasyControl achieves an improvement of 152.0 on FVD and 19.9 on IS, respectively, in UCF101 compared with VideoComposer. For fidelity, our model demonstrates powerful image retention ability, resulting in high FVD and IS in UCF101 and MSR-VTT compared to other image-to-video models.

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

ControlVideo: Training-free Controllable Text-to-Video Generation

VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet

EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models

HARIVO: Harnessing Text-to-Image Models for Video Generation

Training-free Camera Control for Video Generation

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

TrailBlazer: Trajectory Control for Diffusion-Based Video Generation

GVDIFF: Grounded Text-to-Video Generation with Diffusion Models

I4VGen: Image as Free Stepping Stone for Text-to-Video Generation

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

Pix2Video: Video Editing using Image Diffusion