CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Zhao Wang,Aoxue Li,Lingting Zhu,Yong Guo,Qi Dou,Zhenguo Li
2024-05-22
Abstract:Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 63 individual subjects from 13 different categories and 68 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to ensure that multiple subjects appear simultaneously in the generated video and keep the identity characteristics of each subject unconfused in multi - subject - guided text - to - video generation. Specifically, existing methods have difficulties in dealing with multiple subjects and cannot guarantee the simultaneous appearance of different subjects and distinguish highly similar subjects. For this reason, the paper proposes the CustomVideo framework, aiming to solve these problems through a simple and effective co - occurrence and attention control mechanism. ### Main Contributions 1. **Proposing the CustomVideo Framework**: This is a new multi - subject - driven text - to - video generation framework. Through the co - occurrence and attention control mechanism, it can generate high - quality videos that maintain the identity characteristics of the subjects. 2. **Constructing a Multi - subject T2V Dataset**: A dataset containing 63 individual subjects and 68 meaningful subject combinations has been collected as a comprehensive benchmark test. 3. **Superior Performance**: Through extensive experiments, it has been proven that CustomVideo is superior to the existing state - of - the - art methods in both qualitative and quantitative evaluations as well as user preference evaluations. ### Method Overview 1. **Co - occurrence Control**: By splicing multiple subject images into a single image during the training process, it is ensured that the model can learn the simultaneous appearance of multiple subjects. 2. **Attention Control**: Using cross - attention maps and object masks, it is ensured that the model can distinguish different subjects and enhance the quality of the generated video. - **Positive Attention Mechanism**: The model is forced to focus on the correct subject area through the loss function \( L_{\text{attn}} \). - **Negative Attention Mechanism**: By introducing small negative values in the areas outside the object masks, the influence of irrelevant areas on the generated video is reduced. ### Experimental Results 1. **Qualitative Results**: As shown in Figure 3, the videos generated by CustomVideo are significantly superior to other methods in terms of subject fidelity and can effectively distinguish and preserve the characteristics of different subjects. 2. **Quantitative Results**: As shown in Table 1, CustomVideo is superior to the existing state - of - the - art methods in all four evaluation metrics, especially improving by 11.99% and 23.39% in CLIP Image Alignment and DINO Image Alignment respectively. 3. **User Preference Study**: As shown in Figure 4, CustomVideo has obtained the highest user preference scores in terms of text alignment, image alignment and overall quality. ### Ablation Experiments 1. **Component Analysis**: Through experiments such as removing the background, not splicing, and not using the positive or negative attention mechanisms, the effectiveness of each component has been verified. 2. **Attention Control Mechanism**: The positive attention mechanism significantly improves the CLIP Image Alignment metric, and the negative attention mechanism promotes better image alignment and significantly enhances the temporal consistency of the generated video. In conclusion, CustomVideo has successfully solved the key challenges in multi - subject - guided text - to - video generation through innovative co - occurrence and attention control mechanisms, providing a new solution for generating high - quality personalized videos.