Abstract:Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts. Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the shortcomings of existing evaluation metrics in text-to-video generation. Current research primarily relies on automated metrics such as FVD, IS, and CLIP Score to assess the quality of generated videos. However, these metrics have limited capability in evaluating the temporal dimension and cannot fully reflect the true quality of videos. Additionally, while user studies can more accurately reflect human perception, they are time-consuming and prone to subjective biases. Therefore, the paper proposes a new evaluation framework—Text-to-Video Score (T2VScore)—to more accurately assess text-guided generated videos. ### Main Contributions 1. **Introduction of T2VScore**: This is a new automated evaluation metric focusing on two key aspects: text-video alignment and video quality. 2. **Collection of the TVGE Dataset**: This is the first publicly available dataset specifically for evaluating text-to-video generation, containing 2,543 human-annotated text-guided generated videos, covering both alignment and quality evaluations. 3. **Validation of Existing Metrics Against Human Judgments**: Using the TVGE dataset, the paper demonstrates the inconsistency between existing objective metrics and human judgments and proves the superior performance of T2VScore in both aspects. ### Method Overview - **Text-Video Alignment (T2VScore-A)**: - Utilizes state-of-the-art multimodal large language models (MLLMs) to generate questions and answers, assessing whether the video accurately reflects the text description. - Introduces auxiliary trajectory to enhance the understanding of object and camera movements, thereby improving the accuracy of temporal dynamic issues. - **Video Quality (T2VScore-Q)**: - Combines a technical expert and a semantic expert to evaluate video quality. - The technical expert, based on the FAST-VQA model, captures spatial and temporal technical distortions. - The semantic expert, based on the MetaCLIP model, evaluates video quality through a binary classification task. - Employs a Mix-of-Limited-Expert Structure, Progressive Optimization Strategy, and List-wise Learning Objectives to enhance generalization capability. ### Experimental Results - **Text-Video Alignment**: - T2VScore-A performs best in correlation analysis with human judgments, especially when combined with advanced video LLMs like GPT-4V. - The introduction of auxiliary trajectory significantly improves the accuracy of temporal dynamic issues. - **Video Quality**: - T2VScore-Q excels in multiple benchmark tests, particularly showing better generalization capability on unseen generative models. - Compared to existing technical quality evaluation methods, T2VScore-Q has an advantage in comprehensively assessing video quality. ### Conclusion By proposing T2VScore and its accompanying TVGE dataset, the paper provides a more reliable and comprehensive framework for evaluating text-to-video generation. This not only helps researchers better understand the quality of generated videos but also offers valuable resources for future research.

Towards A Better Metric for Text-to-Video Generation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models

What You See Is What Matters: A Novel Visual and Physics-Based Metric for Evaluating Video Generation Quality

[Collinearity in multivariable analysis: causes, detection and control measures].

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Evaluating Text-to-Image Generative Models: An Empirical Study on Human Image Synthesis

EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

AIGC-VQA: A Holistic Perception Metric for AIGC Video Quality Assessment

TAVGBench: Benchmarking Text to Audible-Video Generation

Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer