Abstract:We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: <a class="link-external link-https" href="https://kwonminki.github.io/HARIVO" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? The paper "HARIVO: Harnessing Text - to - Image Models for Video Generation" aims to solve the problem of how to use pre - trained Text - to - Image (T2I) models to generate high - quality videos. Specifically, the authors attempt to improve existing methods in the following ways: 1. **Simplify the training process**: Existing methods usually require fine - tuning the entire model, which not only requires a large amount of data and computing resources but may also lead to a single - style generated video. This paper proposes a new architecture that freezes the T2I model parameters while only training the temporal layers, thus simplifying the training process and reducing the dependence on large - scale datasets. 2. **Maintain diversity and consistency**: Many existing methods have difficulty maintaining the coherence between frames when generating videos, resulting in unnatural - looking videos. This paper introduces new loss functions (such as temporal - regularized self - attention loss and decoupled contrastive loss) and gradient sampling techniques to ensure that the generated videos are temporally consistent while retaining the diversity of the original T2I model. 3. **Seamlessly integrate existing models**: The authors show that their method can be easily combined with existing personalized T2I models (such as DreamBooth, LoRA, ControlNet, and IP - Adapter) without additional training. This allows users to generate personalized video content according to different needs. 4. **Train with public datasets**: Most existing video - generation models rely on private or internal datasets for training, which limits their wide application. The method in this paper can be trained on public datasets (such as WebVid - 10M) and still generate high - quality and temporally consistent videos. ### Main contributions - **Novel architecture design**: Including the mapping network and the frame - wise token generator, these components help to better handle the inter - frame relationships in videos. - **Single - stage training**: Only use public datasets for training, avoiding the complexity of multi - stage training. - **Temporal consistency**: Although trained on public datasets, the generated videos have good temporal consistency. - **Easy integration**: Can be seamlessly combined with existing personalized T2I models, expanding the application scenarios of video generation. Through these innovations, this paper provides a more efficient, flexible, and high - quality text - to - video generation method, providing new ideas for future research and applications.

HARIVO: Harnessing Text-to-Image Models for Video Generation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Video Diffusion Models

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

I4VGen: Image as Free Stepping Stone for Text-to-Video Generation

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Imagen Video: High Definition Video Generation with Diffusion Models

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models

FrameBridge: Improving Image-to-Video Generation with Bridge Models

Photorealistic Video Generation with Diffusion Models