Abstract:Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., `a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., `a red panda climbing a tree' followed by `the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., `a red panda climbing a tree') and second scene description (e.g., `the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline methods by 15.5 points in the overall score, which averages visual consistency and text adherence using human evaluation. The project website is

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of generating multi - scene videos from text descriptions. Specifically, most of the existing text - to - video (T2V) models can only generate single - scene video clips, that is, videos describing an entity performing a specific action (for example, "A red panda climbs a tree"). However, in the real world, multi - scene videos are more common, such as "A red panda climbs a tree" and then "The red panda sleeps on top of the tree". In order to generate multi - scene videos that are consistent with multi - scene text descriptions and are visually consistent, the authors introduced the **Time - Aligned Captions (TALC) framework**. #### Main challenges: 1. **Multi - scene video generation**: Existing T2V models are mainly trained on single - scene datasets and it is difficult to generate videos containing multiple consecutive scenes. 2. **Time alignment**: It is necessary to ensure that the generated video scenes are consistent with the time sequence of the text description. 3. **Visual consistency**: Ensure that the entities and backgrounds between different scenes remain consistent. #### Solutions: - **TALC framework**: By enhancing the text - conditioning mechanism in the T2V model, the model can recognize the time - alignment relationship between video scenes and scene descriptions. The specific method is to condition the visual features of the early and late video frames on the representations of the early and late scene descriptions respectively. - **Fine - tuning pre - trained models**: Use multi - scene video - text data to fine - tune the pre - trained T2V model to improve the quality of the generated videos. #### Experimental results: - The multi - scene videos generated by the TALC framework are significantly better than the baseline methods in terms of visual consistency and text - fitting degree. Specifically, the model fine - tuned by TALC is 15.5 points higher than the baseline method in the overall score, which combines the visual consistency and text - fitting degree evaluated by humans. Through these improvements, the TALC framework can maintain good visual consistency and text - fitting degree when generating multi - scene videos, thereby better simulating the complex scene changes in the real world.

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Text-to-Audio Generation Synchronized with Videos

To Create What You Tell: Generating Videos from Captions

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

ModelScope Text-to-Video Technical Report

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs

Temporally Consistent Transformers for Video Generation

Text-Animator: Controllable Visual Text Video Generation

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Text-Conditioned Resampler For Long Form Video Understanding

Learning Video-Text Aligned Representations for Video Captioning