FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

Yuanxin Liu,Lei Li,Shuhuai Ren,Rundong Gao,Shicheng Li,Sishuo Chen,Xu Sun,Lu Hou

2023-12-26

Abstract:Recently, open-domain text-to-video (T2V) generation models have made remarkable progress. However, the promising results are mainly shown by the qualitative cases of generated videos, while the quantitative evaluation of T2V models still faces two critical problems. Firstly, existing studies lack fine-grained evaluation of T2V models on different categories of text prompts. Although some benchmarks have categorized the prompts, their categorization either only focuses on a single aspect or fails to consider the temporal information in video generation. Secondly, it is unclear whether the automatic evaluation metrics are consistent with human standards. To address these problems, we propose FETV, a benchmark for Fine-grained Evaluation of Text-to-Video generation. FETV is multi-aspect, categorizing the prompts based on three orthogonal aspects: the major content, the attributes to control and the prompt complexity. FETV is also temporal-aware, which introduces several temporal categories tailored for video generation. Based on FETV, we conduct comprehensive manual evaluations of four representative T2V models, revealing their pros and cons on different categories of prompts from different aspects. We also extend FETV as a testbed to evaluate the reliability of automatic T2V metrics. The multi-aspect categorization of FETV enables fine-grained analysis of the metrics' reliability in different scenarios. We find that existing automatic metrics (e.g., CLIPScore and FVD) correlate poorly with human evaluation. To address this problem, we explore several solutions to improve CLIPScore and FVD, and develop two automatic metrics that exhibit significant higher correlation with humans than existing metrics. Benchmark page: <a class="link-external link-https" href="https://github.com/llyx97/FETV" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address two main issues: 1. **Lack of fine-grained evaluation**: Existing research on evaluating open-domain text-to-video (T2V) generation models primarily relies on showcasing the quality of generated videos, lacking fine-grained quantitative evaluation of model performance under different categories of text prompts. Although some benchmark datasets have categorized text prompts, these classifications either focus on a single aspect or ignore the temporal information in video generation. 2. **Reliability of automatic evaluation metrics**: It is currently unclear whether existing automatic evaluation metrics align with human evaluation standards. In the text-to-image generation field, it has been found that existing automatic evaluation metrics do not align with human judgments, but this issue has not been fully explored in the T2V generation field. To address these issues, the paper proposes a benchmark dataset named FETV for fine-grained evaluation of text-to-video generation models. FETV provides a detailed classification of text prompts through a multi-faceted classification system, including main content, attribute control, and prompt complexity, and introduces a time category specifically for video generation. Based on FETV, the authors conducted a comprehensive manual evaluation of four representative T2V models, revealing their strengths and weaknesses under different categories of prompts. Additionally, FETV was used as a testing platform to evaluate the reliability of automatic evaluation metrics, and two new automatic evaluation metrics were developed, which showed significantly higher correlation with human evaluations compared to existing metrics.

FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Towards A Better Metric for Text-to-Video Generation

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

VBench: Comprehensive Benchmark Suite for Video Generative Models

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

[Collinearity in multivariable analysis: causes, detection and control measures].

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

TAVGBench: Benchmarking Text to Audible-Video Generation

VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models

T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation