VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

Daeun Lee,Jaehong Yoon,Jaemin Cho,Mohit Bansal
2024-11-23
Abstract:Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of four stages: In (1) video evaluation, we detect misalignments by generating fine-grained evaluation questions and answering those questions with MLLM. In (2) refinement planning, we identify accurately generated objects and then create localized prompts to refine other areas in the video. Next, in (3) region decomposition, we segment the correctly generated area using a combined grounding module. We regenerate the video by adjusting the misaligned regions while preserving the correct regions in (4) localized refinement. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the common text - video alignment problem in Text - to - Video (T2V) models, especially the alignment errors when dealing with complex scenes. Specifically: 1. **Text - video alignment problem**: Existing T2V diffusion models often fail to accurately follow text prompts when generating videos, especially when the prompts contain multiple objects and attributes. For example, the wrong number of objects may be generated or the attributes may be bound incorrectly. 2. **Limitations of existing methods**: - **Prompt optimization**: Improving the alignment effect by iteratively searching for better prompt words, but this method lacks an explicit feedback mechanism and requires multiple iterations, so it is less efficient. - **Local feedback method**: Although it provides more explicit guidance, it requires an externally trained layout - guided generation module, and the generated objects may be inconsistent with the original image. To solve these problems, the paper proposes a new framework named **VIDEO REPAIR**, which has the following features: - **No additional training required**: VIDEO REPAIR is a training - free framework and can be directly applied to existing T2V diffusion models. - **Fine - grained alignment detection**: It can automatically identify fine - grained text - video alignment errors in the generated videos and provide spatial and text feedback. - **Localized repair**: Targeted local repair is carried out through four stages (video evaluation, repair planning, region decomposition, and localized repair) to ensure that the repaired video is more in line with the text prompt. Through these improvements, VIDEO REPAIR significantly outperforms existing baseline methods on two popular video generation benchmarks (EvalCrafter and T2V - CompBench), especially when dealing with complex scenes. ### Summary The main goal of this paper is to improve the alignment accuracy of text - to - video generation models by introducing a new automatic repair framework - VIDEO REPAIR, especially for multi - object and multi - attribute descriptions in complex scenes. This framework solves the alignment errors and inefficiency problems in existing methods through fine - grained alignment detection and localized repair.