Abstract:Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 63 individual subjects from 13 different categories and 68 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to ensure that multiple subjects appear simultaneously in the generated video and keep the identity characteristics of each subject unconfused in multi - subject - guided text - to - video generation. Specifically, existing methods have difficulties in dealing with multiple subjects and cannot guarantee the simultaneous appearance of different subjects and distinguish highly similar subjects. For this reason, the paper proposes the CustomVideo framework, aiming to solve these problems through a simple and effective co - occurrence and attention control mechanism. ### Main Contributions 1. **Proposing the CustomVideo Framework**: This is a new multi - subject - driven text - to - video generation framework. Through the co - occurrence and attention control mechanism, it can generate high - quality videos that maintain the identity characteristics of the subjects. 2. **Constructing a Multi - subject T2V Dataset**: A dataset containing 63 individual subjects and 68 meaningful subject combinations has been collected as a comprehensive benchmark test. 3. **Superior Performance**: Through extensive experiments, it has been proven that CustomVideo is superior to the existing state - of - the - art methods in both qualitative and quantitative evaluations as well as user preference evaluations. ### Method Overview 1. **Co - occurrence Control**: By splicing multiple subject images into a single image during the training process, it is ensured that the model can learn the simultaneous appearance of multiple subjects. 2. **Attention Control**: Using cross - attention maps and object masks, it is ensured that the model can distinguish different subjects and enhance the quality of the generated video. - **Positive Attention Mechanism**: The model is forced to focus on the correct subject area through the loss function \( L_{\text{attn}} \). - **Negative Attention Mechanism**: By introducing small negative values in the areas outside the object masks, the influence of irrelevant areas on the generated video is reduced. ### Experimental Results 1. **Qualitative Results**: As shown in Figure 3, the videos generated by CustomVideo are significantly superior to other methods in terms of subject fidelity and can effectively distinguish and preserve the characteristics of different subjects. 2. **Quantitative Results**: As shown in Table 1, CustomVideo is superior to the existing state - of - the - art methods in all four evaluation metrics, especially improving by 11.99% and 23.39% in CLIP Image Alignment and DINO Image Alignment respectively. 3. **User Preference Study**: As shown in Figure 4, CustomVideo has obtained the highest user preference scores in terms of text alignment, image alignment and overall quality. ### Ablation Experiments 1. **Component Analysis**: Through experiments such as removing the background, not splicing, and not using the positive or negative attention mechanisms, the effectiveness of each component has been verified. 2. **Attention Control Mechanism**: The positive attention mechanism significantly improves the CLIP Image Alignment metric, and the negative attention mechanism promotes better image alignment and significantly enhances the temporal consistency of the generated video. In conclusion, CustomVideo has successfully solved the key challenges in multi - subject - guided text - to - video generation through innovative co - occurrence and attention control mechanisms, providing a new solution for generating high - quality personalized videos.

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

MotionBooth: Motion-Aware Customized Text-to-Video Generation

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

Magic-Me: Identity-Specific Video Customized Diffusion

ControlVideo: Training-free Controllable Text-to-Video Generation

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

VideoBooth: Diffusion-based Video Generation with Image Prompts

NewMove: Customizing text-to-video models with novel motions