Abstract:Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of four stages: In (1) video evaluation, we detect misalignments by generating fine-grained evaluation questions and answering those questions with MLLM. In (2) refinement planning, we identify accurately generated objects and then create localized prompts to refine other areas in the video. Next, in (3) region decomposition, we segment the correctly generated area using a combined grounding module. We regenerate the video by adjusting the misaligned regions while preserving the correct regions in (4) localized refinement. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the common text - video alignment problem in Text - to - Video (T2V) models, especially the alignment errors when dealing with complex scenes. Specifically: 1. **Text - video alignment problem**: Existing T2V diffusion models often fail to accurately follow text prompts when generating videos, especially when the prompts contain multiple objects and attributes. For example, the wrong number of objects may be generated or the attributes may be bound incorrectly. 2. **Limitations of existing methods**: - **Prompt optimization**: Improving the alignment effect by iteratively searching for better prompt words, but this method lacks an explicit feedback mechanism and requires multiple iterations, so it is less efficient. - **Local feedback method**: Although it provides more explicit guidance, it requires an externally trained layout - guided generation module, and the generated objects may be inconsistent with the original image. To solve these problems, the paper proposes a new framework named **VIDEO REPAIR**, which has the following features: - **No additional training required**: VIDEO REPAIR is a training - free framework and can be directly applied to existing T2V diffusion models. - **Fine - grained alignment detection**: It can automatically identify fine - grained text - video alignment errors in the generated videos and provide spatial and text feedback. - **Localized repair**: Targeted local repair is carried out through four stages (video evaluation, repair planning, region decomposition, and localized repair) to ensure that the repaired video is more in line with the text prompt. Through these improvements, VIDEO REPAIR significantly outperforms existing baseline methods on two popular video generation benchmarks (EvalCrafter and T2V - CompBench), especially when dealing with complex scenes. ### Summary The main goal of this paper is to improve the alignment accuracy of text - to - video generation models by introducing a new automatic repair framework - VIDEO REPAIR, especially for multi - object and multi - attribute descriptions in complex scenes. This framework solves the alignment errors and inefficiency problems in existing methods through fine - grained alignment detection and localized repair.

VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

VideoDirector: Precise Video Editing via Text-to-Video Models

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Learning Video-Text Aligned Representations for Video Captioning

InstructVideo: Instructing Video Diffusion Models with Human Feedback

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Zero-Shot Video Editing through Adaptive Sliding Score Distillation

Align and Tell: Boosting Text-Video Retrieval With Local Alignment and Fine-Grained Supervision

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification

S$^2$AG-Vid: Enhancing Multi-Motion Alignment in Video Diffusion Models via Spatial and Syntactic Attention-Based Guidance

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Mimir: Improving Video Diffusion Models for Precise Text Understanding

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback