FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

Yuanxin Liu,Lei Li,Shuhuai Ren,Rundong Gao,Shicheng Li,Sishuo Chen,Xu Sun,Lu Hou
2023-12-26
Abstract:Recently, open-domain text-to-video (T2V) generation models have made remarkable progress. However, the promising results are mainly shown by the qualitative cases of generated videos, while the quantitative evaluation of T2V models still faces two critical problems. Firstly, existing studies lack fine-grained evaluation of T2V models on different categories of text prompts. Although some benchmarks have categorized the prompts, their categorization either only focuses on a single aspect or fails to consider the temporal information in video generation. Secondly, it is unclear whether the automatic evaluation metrics are consistent with human standards. To address these problems, we propose FETV, a benchmark for Fine-grained Evaluation of Text-to-Video generation. FETV is multi-aspect, categorizing the prompts based on three orthogonal aspects: the major content, the attributes to control and the prompt complexity. FETV is also temporal-aware, which introduces several temporal categories tailored for video generation. Based on FETV, we conduct comprehensive manual evaluations of four representative T2V models, revealing their pros and cons on different categories of prompts from different aspects. We also extend FETV as a testbed to evaluate the reliability of automatic T2V metrics. The multi-aspect categorization of FETV enables fine-grained analysis of the metrics' reliability in different scenarios. We find that existing automatic metrics (e.g., CLIPScore and FVD) correlate poorly with human evaluation. To address this problem, we explore several solutions to improve CLIPScore and FVD, and develop two automatic metrics that exhibit significant higher correlation with humans than existing metrics. Benchmark page: <a class="link-external link-https" href="https://github.com/llyx97/FETV" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address two main issues: 1. **Lack of fine-grained evaluation**: Existing research on evaluating open-domain text-to-video (T2V) generation models primarily relies on showcasing the quality of generated videos, lacking fine-grained quantitative evaluation of model performance under different categories of text prompts. Although some benchmark datasets have categorized text prompts, these classifications either focus on a single aspect or ignore the temporal information in video generation. 2. **Reliability of automatic evaluation metrics**: It is currently unclear whether existing automatic evaluation metrics align with human evaluation standards. In the text-to-image generation field, it has been found that existing automatic evaluation metrics do not align with human judgments, but this issue has not been fully explored in the T2V generation field. To address these issues, the paper proposes a benchmark dataset named FETV for fine-grained evaluation of text-to-video generation models. FETV provides a detailed classification of text prompts through a multi-faceted classification system, including main content, attribute control, and prompt complexity, and introduces a time category specifically for video generation. Based on FETV, the authors conducted a comprehensive manual evaluation of four representative T2V models, revealing their strengths and weaknesses under different categories of prompts. Additionally, FETV was used as a testing platform to evaluate the reliability of automatic evaluation metrics, and two new automatic evaluation metrics were developed, which showed significantly higher correlation with human evaluations compared to existing metrics.