HARIVO: Harnessing Text-to-Image Models for Video Generation

Mingi Kwon,Seoung Wug Oh,Yang Zhou,Difan Liu,Joon-Young Lee,Haoran Cai,Baqiao Liu,Feng Liu,Youngjung Uh
2024-10-10
Abstract:We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: <a class="link-external link-https" href="https://kwonminki.github.io/HARIVO" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? The paper "HARIVO: Harnessing Text - to - Image Models for Video Generation" aims to solve the problem of how to use pre - trained Text - to - Image (T2I) models to generate high - quality videos. Specifically, the authors attempt to improve existing methods in the following ways: 1. **Simplify the training process**: Existing methods usually require fine - tuning the entire model, which not only requires a large amount of data and computing resources but may also lead to a single - style generated video. This paper proposes a new architecture that freezes the T2I model parameters while only training the temporal layers, thus simplifying the training process and reducing the dependence on large - scale datasets. 2. **Maintain diversity and consistency**: Many existing methods have difficulty maintaining the coherence between frames when generating videos, resulting in unnatural - looking videos. This paper introduces new loss functions (such as temporal - regularized self - attention loss and decoupled contrastive loss) and gradient sampling techniques to ensure that the generated videos are temporally consistent while retaining the diversity of the original T2I model. 3. **Seamlessly integrate existing models**: The authors show that their method can be easily combined with existing personalized T2I models (such as DreamBooth, LoRA, ControlNet, and IP - Adapter) without additional training. This allows users to generate personalized video content according to different needs. 4. **Train with public datasets**: Most existing video - generation models rely on private or internal datasets for training, which limits their wide application. The method in this paper can be trained on public datasets (such as WebVid - 10M) and still generate high - quality and temporally consistent videos. ### Main contributions - **Novel architecture design**: Including the mapping network and the frame - wise token generator, these components help to better handle the inter - frame relationships in videos. - **Single - stage training**: Only use public datasets for training, avoiding the complexity of multi - stage training. - **Temporal consistency**: Although trained on public datasets, the generated videos have good temporal consistency. - **Easy integration**: Can be seamlessly combined with existing personalized T2I models, expanding the application scenarios of video generation. Through these innovations, this paper provides a more efficient, flexible, and high - quality text - to - video generation method, providing new ideas for future research and applications.