Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

Tengchuan Kou,Xiaohong Liu,Zicheng Zhang,Chunyi Li,Haoning Wu,Xiongkuo Min,Guangtao Zhai,Ning Liu

2024-08-08

Abstract:With the rapid development of generative models, Artificial Intelligence-Generated Contents (AIGC) have exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models. We also conduct a subjective study to obtain each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code will be released at <a class="link-external link-https" href="https://github.com/QMME/T2VQA" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of quality assessment for Text-to-Video (T2V) generation. Although many existing T2V models can generate videos with high perceptual quality, there is currently a lack of a method to quantitatively evaluate the quality of these videos. Specifically, the paper points out the following: 1. **Deficiencies of Existing Evaluation Methods**: - **Traditional Video Quality Assessment (VQA) Models**: These models cannot perform the task well because the distortions caused by T2V generation models (such as jitter effects, unreasonable objects, etc.) are different from those in natural videos. - **Common T2V Evaluation Metrics**: Metrics such as IS (Inception Score), FVD (Fréchet Video Distance), and CLIPSim cannot reflect real user preferences. These metrics have limitations in evaluating video quality. For example, IS cannot accurately assess image/video quality, FVD requires reference to natural videos, and CLIPSim only considers text-video alignment from an image perspective, ignoring temporal information and perceptual video quality. 2. **Proposed New Methods**: - **Establishing a Large-Scale Dataset**: The paper establishes the largest subjective T2V dataset to date, named T2VQA-DB, which contains 10,000 videos generated by 9 different T2V models and their corresponding Mean Opinion Scores (MOS). - **Proposing a New Evaluation Model**: Based on T2VQA-DB, the paper proposes a new transformer-based model called T2VQA for subjective alignment in text-to-video quality assessment. This model extracts features from both text-video alignment and video fidelity perspectives and uses large language models (LLM) for predictive scoring. Through these methods, the paper aims to provide a more comprehensive and accurate T2V quality assessment tool to promote the development and application of T2V generation technology.

Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Towards A Better Metric for Text-to-Video Generation

Benchmarking Multi-dimensional AIGC Video Quality Assessment: A Dataset and Unified Model

AIGC-VQA: A Holistic Perception Metric for AIGC Video Quality Assessment

A dataset of text prompts, videos and video quality metrics from generative text-to-video AI models

FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation

Video Quality Assessment: A Comprehensive Survey

T2I-Scorer: Quantitative Evaluation on Text-to-Image Generation Via Fine-Tuned Large Multi-Modal Models

User-generated Video Quality Assessment: A Subjective and Objective Study

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

UGC-VQA: Benchmarking Blind Video Quality Assessment for User Generated Content

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

Subjective and Objective Audio-Visual Quality Assessment for User Generated Content

Video Transformer based Video Quality Assessment with Spatiotemporally adaptive Token Selection and Assembly

FineVQ: Fine-Grained User Generated Content Video Quality Assessment

TAVGBench: Benchmarking Text to Audible-Video Generation

A Completely Blind Video Quality Evaluator

Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap