Abstract:The rapid advancement of large multimodal models (LMMs) has led to the rapid expansion of artificial intelligence generated videos (AIGVs), which highlights the pressing need for effective video quality assessment (VQA) models designed specifically for AIGVs. Current VQA models generally fall short in accurately assessing the perceptual quality of AIGVs due to the presence of unique distortions, such as unrealistic objects, unnatural movements, or inconsistent visual elements. To address this challenge, we first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 diverse prompts. With these AIGVs, a systematic annotation pipeline including scoring and ranking processes is devised, which collects 370k expert ratings to date. Based on AIGVQA-DB, we further introduce AIGV-Assessor, a novel VQA model that leverages spatiotemporal features and LMM frameworks to capture the intricate quality attributes of AIGVs, thereby accurately predicting precise video quality scores and video pair preferences. Through comprehensive experiments on both AIGVQA-DB and existing AIGV databases, AIGV-Assessor demonstrates state-of-the-art performance, significantly surpassing existing scoring or evaluation methods in terms of multiple perceptual quality dimensions.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are as follows: Current video quality assessment (VQA) models perform poorly in evaluating the perceptual quality of AI - generated videos (AIGVs), especially when dealing with the unique distortions in AIGVs (such as unrealistic objects, unnatural movements, or inconsistent visual elements). Specifically: 1. **Limitations of traditional VQA models**: - Traditional VQA models are mainly used to evaluate professionally - generated content (PGC) and user - generated content (UGC), and it is difficult to effectively evaluate the unique distortions in AIGVs (such as spatial artifacts, temporal inconsistencies, and misalignments between text descriptions and generated content). - Existing evaluation metrics (such as Inception Score and Fréchet Video Distance) mainly focus on the overall distribution of videos and fail to reflect human preferences for individual videos. - Vision - language pre - training models (such as CLIPScore, BLIPScore, and AestheticScore) can evaluate the consistency between text and video, but they ignore the dynamic diversity and motion consistency of videos. 2. **Lack of large - scale AIGVs datasets**: - There is a lack of a dataset containing a large number of AI - generated videos for systematically evaluating the quality of these videos, especially in multiple perceptual quality dimensions. To solve these problems, the paper makes two main contributions: 1. **Constructing a large - scale AIGVs dataset AIGVQA - DB**: - It contains 36,576 AI - generated videos generated by 15 advanced text - to - video models, using 1,048 diverse prompts. - Through a systematic annotation process, 370,000 expert ratings were collected, covering four dimensions: static quality, temporal smoothness, degree of dynamics, and text - video correspondence. 2. **Proposing a new VQA model AIGV - Assessor**: - AIGV - Assessor is based on large - modality models (LMM) and uses spatio - temporal features to capture the complex quality attributes of AIGVs, thereby accurately predicting video quality scores and video - pair preferences. - This model can not only classify videos into different quality levels through natural - language output, but also generate accurate quality scores through regression tasks, enhancing the interpretability and usability of VQA results. - It performs excellently in pairwise video comparisons and can conduct more detailed evaluations closer to human preferences. Through these methods, the paper aims to develop more comprehensive and accurate metrics to evaluate the quality of AI - generated videos and promote the further development of this field.

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM

AIGC-VQA: A Holistic Perception Metric for AIGC Video Quality Assessment

Advancing Video Quality Assessment for AIGC

Exploring AIGC Video Quality: A Focus on Visual Harmony, Video-Text Consistency and Domain Distribution Gap

Video Quality Assessment: A Comprehensive Survey

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt Condition

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

A Perceptual Quality Assessment Exploration for AIGC Images

Audio-Visual Quality Assessment for User Generated Content: Database and Method

Subjective and Objective Audio-Visual Quality Assessment for User Generated Content

VQA$^2$:Visual Question Answering for Video Quality Assessment

Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

Perceptual Quality Assessment of Internet Videos.

Evaluating Text-to-Visual Generation with Image-to-Text Generation

UGC-VQA: Benchmarking Blind Video Quality Assessment for User Generated Content

AIGIQA-20K: A Large Database for AI-Generated Image Quality Assessment

XGC-VQA: A unified video quality assessment model for User, Professionally, and Occupationally-Generated Content

Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

A Survey of AI-Generated Video Evaluation