GAIA: Rethinking Action Quality Assessment for AI-Generated Videos

Zijian Chen,Wei Sun,Yuan Tian,Jun Jia,Zicheng Zhang,Jiarui Wang,Ru Huang,Xiongkuo Min,Guangtao Zhai,Wenjun Zhang
2024-10-14
Abstract:Assessing action quality is both imperative and challenging due to its significant impact on the quality of AI-generated videos, further complicated by the inherently ambiguous nature of actions within AI-generated video (AIGV). Current action quality assessment (AQA) algorithms predominantly focus on actions from real specific scenarios and are pre-trained with normative action features, thus rendering them inapplicable in AIGVs. To address these problems, we construct GAIA, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective, resulting in 971,244 ratings among 9,180 video-action pairs. Based on GAIA, we evaluate a suite of popular text-to-video (T2V) models on their ability to generate visually rational actions, revealing their pros and cons on different categories of actions. We also extend GAIA as a testbed to benchmark the AQA capacity of existing automatic evaluation methods. Results show that traditional AQA methods, action-related metrics in recent T2V benchmarks, and mainstream video quality methods perform poorly with an average SRCC of 0.454, 0.191, and 0.519, respectively, indicating a sizable gap between current models and human action perception patterns in AIGVs. Our findings underscore the significance of action quality as a unique perspective for studying AIGVs and can catalyze progress towards methods with enhanced capacities for AQA in AIGVs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue that existing Action Quality Assessment (AQA) methods perform poorly when evaluating the quality of actions in AI-generated videos (AIGV). Specifically: 1. **Limitations of Existing AQA Datasets**: - Existing AQA datasets mainly focus on actions in real videos from specific domains such as sports and fitness, and the collected scores are primarily coarse-grained professional ratings, lacking consideration for the diversity of different scenarios. - The content of these datasets often has little variation because the action subjects usually perform similar actions in consistent environments, lacking scene diversity. 2. **Shortcomings of Existing AQA Methods**: - Existing AQA methods are mainly based on pose or visual feature extraction, aggregation, and score regression. These methods typically use powerful 3D backbone networks for pre-training to achieve better feature transferability. - However, generated videos may contain atypical actions, such as abnormal limb counts, illogical object shapes, and physically impossible movements, making models learned from real videos perform poorly in AIGV. 3. **Special Challenges of AI-Generated Videos**: - There are fundamental differences between generated videos and real videos, making it more difficult to evaluate the quality of actions in generated videos. - With the exponential growth of text-to-video (T2V) models, the challenge of evaluating video action quality has become more severe, requiring reliable solutions. To address these issues, the paper proposes GAIA (Generic AI-generated Action dataset), a large-scale subjective evaluation dataset that assesses the quality of actions in AI-generated videos from a causal inference perspective. The GAIA dataset includes 9,180 videos with a total of 971,244 human ratings, covering a variety of full-body, hand, and facial actions. Using this dataset, the paper evaluates the ability of 18 popular T2V models to generate visually reasonable actions and reveals their strengths and weaknesses across different categories of actions. Additionally, GAIA is used as a benchmark platform to evaluate the performance of existing automatic evaluation methods in AQA tasks. The results show that traditional AQA methods, action-related metrics in recent T2V benchmarks, and mainstream video quality methods perform poorly in AIGV, with average SRCCs of 0.454, 0.191, and 0.519, respectively, indicating a significant gap between current models and human action perception patterns.