TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Hritik Bansal,Yonatan Bitton,Michal Yarom,Idan Szpektor,Aditya Grover,Kai-Wei Chang
2024-05-16
Abstract:Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., `a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., `a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., `a red panda climbing a tree') and second scene description (e.g., `the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation. The project website is
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of generating multi - scene videos from text descriptions. Specifically, most of the existing text - to - video (T2V) models can only generate single - scene video clips, that is, videos describing an entity performing a specific action (for example, "A red panda climbs a tree"). However, in the real world, multi - scene videos are more common, such as "A red panda climbs a tree" and then "The red panda sleeps on top of the tree". In order to generate multi - scene videos that are consistent with multi - scene text descriptions and are visually consistent, the authors introduced the **Time - Aligned Captions (TALC) framework**. #### Main challenges: 1. **Multi - scene video generation**: Existing T2V models are mainly trained on single - scene datasets and it is difficult to generate videos containing multiple consecutive scenes. 2. **Time alignment**: It is necessary to ensure that the generated video scenes are consistent with the time sequence of the text description. 3. **Visual consistency**: Ensure that the entities and backgrounds between different scenes remain consistent. #### Solutions: - **TALC framework**: By enhancing the text - conditioning mechanism in the T2V model, the model can recognize the time - alignment relationship between video scenes and scene descriptions. The specific method is to condition the visual features of the early and late video frames on the representations of the early and late scene descriptions respectively. - **Fine - tuning pre - trained models**: Use multi - scene video - text data to fine - tune the pre - trained T2V model to improve the quality of the generated videos. #### Experimental results: - The multi - scene videos generated by the TALC framework are significantly better than the baseline methods in terms of visual consistency and text - fitting degree. Specifically, the model fine - tuned by TALC is 15.5 points higher than the baseline method in the overall score, which combines the visual consistency and text - fitting degree evaluated by humans. Through these improvements, the TALC framework can maintain good visual consistency and text - fitting degree when generating multi - scene videos, thereby better simulating the complex scene changes in the real world.