AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

Jiarui Wang,Huiyu Duan,Guangtao Zhai,Juntong Wang,Xiongkuo Min
2024-11-26
Abstract:The rapid advancement of large multimodal models (LMMs) has led to the rapid expansion of artificial intelligence generated videos (AIGVs), which highlights the pressing need for effective video quality assessment (VQA) models designed specifically for AIGVs. Current VQA models generally fall short in accurately assessing the perceptual quality of AIGVs due to the presence of unique distortions, such as unrealistic objects, unnatural movements, or inconsistent visual elements. To address this challenge, we first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 diverse prompts. With these AIGVs, a systematic annotation pipeline including scoring and ranking processes is devised, which collects 370k expert ratings to date. Based on AIGVQA-DB, we further introduce AIGV-Assessor, a novel VQA model that leverages spatiotemporal features and LMM frameworks to capture the intricate quality attributes of AIGVs, thereby accurately predicting precise video quality scores and video pair preferences. Through comprehensive experiments on both AIGVQA-DB and existing AIGV databases, AIGV-Assessor demonstrates state-of-the-art performance, significantly surpassing existing scoring or evaluation methods in terms of multiple perceptual quality dimensions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are as follows: Current video quality assessment (VQA) models perform poorly in evaluating the perceptual quality of AI - generated videos (AIGVs), especially when dealing with the unique distortions in AIGVs (such as unrealistic objects, unnatural movements, or inconsistent visual elements). Specifically: 1. **Limitations of traditional VQA models**: - Traditional VQA models are mainly used to evaluate professionally - generated content (PGC) and user - generated content (UGC), and it is difficult to effectively evaluate the unique distortions in AIGVs (such as spatial artifacts, temporal inconsistencies, and misalignments between text descriptions and generated content). - Existing evaluation metrics (such as Inception Score and Fréchet Video Distance) mainly focus on the overall distribution of videos and fail to reflect human preferences for individual videos. - Vision - language pre - training models (such as CLIPScore, BLIPScore, and AestheticScore) can evaluate the consistency between text and video, but they ignore the dynamic diversity and motion consistency of videos. 2. **Lack of large - scale AIGVs datasets**: - There is a lack of a dataset containing a large number of AI - generated videos for systematically evaluating the quality of these videos, especially in multiple perceptual quality dimensions. To solve these problems, the paper makes two main contributions: 1. **Constructing a large - scale AIGVs dataset AIGVQA - DB**: - It contains 36,576 AI - generated videos generated by 15 advanced text - to - video models, using 1,048 diverse prompts. - Through a systematic annotation process, 370,000 expert ratings were collected, covering four dimensions: static quality, temporal smoothness, degree of dynamics, and text - video correspondence. 2. **Proposing a new VQA model AIGV - Assessor**: - AIGV - Assessor is based on large - modality models (LMM) and uses spatio - temporal features to capture the complex quality attributes of AIGVs, thereby accurately predicting video quality scores and video - pair preferences. - This model can not only classify videos into different quality levels through natural - language output, but also generate accurate quality scores through regression tasks, enhancing the interpretability and usability of VQA results. - It performs excellently in pairwise video comparisons and can conduct more detailed evaluations closer to human preferences. Through these methods, the paper aims to develop more comprehensive and accurate metrics to evaluate the quality of AI - generated videos and promote the further development of this field.