Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

Tengchuan Kou,Xiaohong Liu,Zicheng Zhang,Chunyi Li,Haoning Wu,Xiongkuo Min,Guangtao Zhai,Ning Liu
2024-08-08
Abstract:With the rapid development of generative models, Artificial Intelligence-Generated Contents (AIGC) have exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We also conduct a subjective study to obtain each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code will be released at <a class="link-external link-https" href="https://github.com/QMME/T2VQA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of quality assessment for Text-to-Video (T2V) generation. Although many existing T2V models can generate videos with high perceptual quality, there is currently a lack of a method to quantitatively evaluate the quality of these videos. Specifically, the paper points out the following: 1. **Deficiencies of Existing Evaluation Methods**: - **Traditional Video Quality Assessment (VQA) Models**: These models cannot perform the task well because the distortions caused by T2V generation models (such as jitter effects, unreasonable objects, etc.) are different from those in natural videos. - **Common T2V Evaluation Metrics**: Metrics such as IS (Inception Score), FVD (Fréchet Video Distance), and CLIPSim cannot reflect real user preferences. These metrics have limitations in evaluating video quality. For example, IS cannot accurately assess image/video quality, FVD requires reference to natural videos, and CLIPSim only considers text-video alignment from an image perspective, ignoring temporal information and perceptual video quality. 2. **Proposed New Methods**: - **Establishing a Large-Scale Dataset**: The paper establishes the largest subjective T2V dataset to date, named T2VQA-DB, which contains 10,000 videos generated by 9 different T2V models and their corresponding Mean Opinion Scores (MOS). - **Proposing a New Evaluation Model**: Based on T2VQA-DB, the paper proposes a new transformer-based model called T2VQA for subjective alignment in text-to-video quality assessment. This model extracts features from both text-video alignment and video fidelity perspectives and uses large language models (LLM) for predictive scoring. Through these methods, the paper aims to provide a more comprehensive and accurate T2V quality assessment tool to promote the development and application of T2V generation technology.