Towards A Better Metric for Text-to-Video Generation

Jay Zhangjie Wu,Guian Fang,Haoning Wu,Xintao Wang,Yixiao Ge,Xiaodong Cun,David Junhao Zhang,Jia-Wei Liu,Yuchao Gu,Rui Zhao,Weisi Lin,Wynne Hsu,Ying Shan,Mike Zheng Shou
2024-01-15
Abstract:Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts. Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the shortcomings of existing evaluation metrics in text-to-video generation. Current research primarily relies on automated metrics such as FVD, IS, and CLIP Score to assess the quality of generated videos. However, these metrics have limited capability in evaluating the temporal dimension and cannot fully reflect the true quality of videos. Additionally, while user studies can more accurately reflect human perception, they are time-consuming and prone to subjective biases. Therefore, the paper proposes a new evaluation framework—Text-to-Video Score (T2VScore)—to more accurately assess text-guided generated videos. ### Main Contributions 1. **Introduction of T2VScore**: This is a new automated evaluation metric focusing on two key aspects: text-video alignment and video quality. 2. **Collection of the TVGE Dataset**: This is the first publicly available dataset specifically for evaluating text-to-video generation, containing 2,543 human-annotated text-guided generated videos, covering both alignment and quality evaluations. 3. **Validation of Existing Metrics Against Human Judgments**: Using the TVGE dataset, the paper demonstrates the inconsistency between existing objective metrics and human judgments and proves the superior performance of T2VScore in both aspects. ### Method Overview - **Text-Video Alignment (T2VScore-A)**: - Utilizes state-of-the-art multimodal large language models (MLLMs) to generate questions and answers, assessing whether the video accurately reflects the text description. - Introduces auxiliary trajectory to enhance the understanding of object and camera movements, thereby improving the accuracy of temporal dynamic issues. - **Video Quality (T2VScore-Q)**: - Combines a technical expert and a semantic expert to evaluate video quality. - The technical expert, based on the FAST-VQA model, captures spatial and temporal technical distortions. - The semantic expert, based on the MetaCLIP model, evaluates video quality through a binary classification task. - Employs a Mix-of-Limited-Expert Structure, Progressive Optimization Strategy, and List-wise Learning Objectives to enhance generalization capability. ### Experimental Results - **Text-Video Alignment**: - T2VScore-A performs best in correlation analysis with human judgments, especially when combined with advanced video LLMs like GPT-4V. - The introduction of auxiliary trajectory significantly improves the accuracy of temporal dynamic issues. - **Video Quality**: - T2VScore-Q excels in multiple benchmark tests, particularly showing better generalization capability on unseen generative models. - Compared to existing technical quality evaluation methods, T2VScore-Q has an advantage in comprehensively assessing video quality. ### Conclusion By proposing T2VScore and its accompanying TVGE dataset, the paper provides a more reliable and comprehensive framework for evaluating text-to-video generation. This not only helps researchers better understand the quality of generated videos but also offers valuable resources for future research.