VideoBooth: Diffusion-based Video Generation with Image Prompts

Yuming Jiang,Tianxing Wu,Shuai Yang,Chenyang Si,Dahua Lin,Yu Qiao,Chen Change Loy,Ziwei Liu
2023-12-02
Abstract:Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper mainly aims to address the following issues: ### Research Background and Objectives - **Limitations of Video Generation**: Although existing text-driven video generation technologies have made rapid progress, relying solely on text prompts is insufficient to accurately depict the content users want, especially when customized content creation is needed. - **Advantages of Image Prompts**: Compared to pure text prompts, providing image prompts can more accurately control the appearance details of the generated content. ### Issues Addressed - **Research Task**: The paper investigates the task of video generation using image prompts, aiming to generate high-quality videos that match the appearance of objects specified in the image prompts. - **Challenges**: - Accurately capturing the attributes in the image prompts and reflecting these attributes in the generated videos. - Achieving dynamic motion of objects while maintaining appearance consistency and natural smoothness. ### Main Contributions 1. **New Method**: Proposes a feedforward framework named VideoBooth for generating videos using image prompts without requiring fine-tuning during the inference stage. 2. **Coarse-to-Fine Visual Embedding Strategy**: Introduces a new coarse-to-fine visual embedding strategy to better capture the characteristics of image prompts through an image encoder and attention injection. 3. **Attention Injection Method**: Proposes a novel attention injection method that utilizes the spatial information of multi-scale image prompts to refine the generated details. 4. **Dataset**: Constructs a dedicated VideoBooth dataset containing videos, image prompts, and text prompts to support the research of this task. ### Technical Implementation - **Coarse Visual Embedding**: Extracts image features through a pre-trained image encoder and maps them to the text embedding space to embed the coarse appearance information of the image prompts. - **Fine Visual Embedding**: Uses an attention injection module to take multi-scale image prompts as additional key-value pairs, refining the details of the first frame and maintaining temporal consistency between subsequent frames. - **Training Strategy**: Adopts a coarse-to-fine training strategy, first training the coarse visual embedding and then the fine visual embedding to avoid information leakage and the learning of meaningless representations. ### Conclusion - VideoBooth is a general framework capable of handling a wide range of image prompts within a single model and generating videos that match the appearance of objects specified in the image prompts while maintaining good temporal consistency.