Abstract:Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.

What problem does this paper attempt to address?

The paper mainly aims to address the following issues: ### Research Background and Objectives - **Limitations of Video Generation**: Although existing text-driven video generation technologies have made rapid progress, relying solely on text prompts is insufficient to accurately depict the content users want, especially when customized content creation is needed. - **Advantages of Image Prompts**: Compared to pure text prompts, providing image prompts can more accurately control the appearance details of the generated content. ### Issues Addressed - **Research Task**: The paper investigates the task of video generation using image prompts, aiming to generate high-quality videos that match the appearance of objects specified in the image prompts. - **Challenges**: - Accurately capturing the attributes in the image prompts and reflecting these attributes in the generated videos. - Achieving dynamic motion of objects while maintaining appearance consistency and natural smoothness. ### Main Contributions 1. **New Method**: Proposes a feedforward framework named VideoBooth for generating videos using image prompts without requiring fine-tuning during the inference stage. 2. **Coarse-to-Fine Visual Embedding Strategy**: Introduces a new coarse-to-fine visual embedding strategy to better capture the characteristics of image prompts through an image encoder and attention injection. 3. **Attention Injection Method**: Proposes a novel attention injection method that utilizes the spatial information of multi-scale image prompts to refine the generated details. 4. **Dataset**: Constructs a dedicated VideoBooth dataset containing videos, image prompts, and text prompts to support the research of this task. ### Technical Implementation - **Coarse Visual Embedding**: Extracts image features through a pre-trained image encoder and maps them to the text embedding space to embed the coarse appearance information of the image prompts. - **Fine Visual Embedding**: Uses an attention injection module to take multi-scale image prompts as additional key-value pairs, refining the details of the first frame and maintaining temporal consistency between subsequent frames. - **Training Strategy**: Adopts a coarse-to-fine training strategy, first training the coarse visual embedding and then the fine visual embedding to avoid information leakage and the learning of meaningless representations. ### Conclusion - VideoBooth is a general framework capable of handling a wide range of image prompts within a single model and generating videos that match the appearance of objects specified in the image prompts while maintaining good temporal consistency.

VideoBooth: Diffusion-based Video Generation with Image Prompts

HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Motion Prompting: Controlling Video Generation with Motion Trajectories

MotionBooth: Motion-Aware Customized Text-to-Video Generation

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation

Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models

Object-level Visual Prompts for Compositional Image Generation

Dynamic Prompt Optimizing for Text-to-Image Generation

DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions

Optical-Flow Guided Prompt Optimization for Coherent Video Generation

User-Friendly Customized Generation with Multi-Modal Prompts

AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

Optimizing Prompts for Text-to-Image Generation

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation