VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

Xuan He,Dongfu Jiang,Ge Zhang,Max Ku,Achint Soni,Sherman Siu,Haonan Chen,Abhranil Chandra,Ziyan Jiang,Aaran Arulraj,Kai Wang,Quy Duc Do,Yuansheng Ni,Bohan Lyu,Yaswanth Narsupalli,Rongqi Fan,Zhiheng Lyu,Yuchen Lin,Wenhu Chen

2024-10-14

Abstract:The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The main problem this paper attempts to address is the inadequacy of existing automatic evaluation metrics for video generation. Although text-to-video (T2V) generation models have made significant progress in recent years, the automatic metrics used to evaluate the quality of videos generated by these models lag far behind. The existing metrics have the following issues: 1. **Distribution Calculation**: Some metrics require distribution-based calculations and cannot be directly applied to individual model outputs, such as FVD and IS. 2. **Single-Dimension Evaluation**: Most metrics can only evaluate visual quality or text alignment, but cannot cover other important aspects such as motion smoothness, factual consistency, etc. 3. **Lack of Fine-Grained Scoring**: Some metrics only provide an overall score and cannot offer detailed sub-scores in multiple aspects. 4. **Low Correlation with Human Judgment**: Although some studies attempt to evaluate video quality through multimodal large language models (MLLM), these methods have a low correlation with human judgment. To address these issues, the paper proposes two main contributions: 1. **Constructing a Large-Scale Human-Annotated Dataset**: Created the **VIDEO FEEDBACK** dataset, which includes 37.6K human-annotated synthetic videos from 11 existing T2V generation models. 2. **Training an Automatic Video Evaluation Model**: Based on the **VIDEO FEEDBACK** dataset, trained an automatic video quality evaluation model named **VIDEO SCORE**, which can simulate human feedback and provide fine-grained multi-aspect scoring. Through these contributions, the paper aims to provide a reliable automatic evaluation tool to help researchers and developers better evaluate and improve T2V generation models.

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

A Human-Machine Collaborative Video Summarization Framework Using Pupillary Response Signals

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

Human Visual Perception Based Image Quality Assessment for Video Prediction

Towards A Better Metric for Text-to-Video Generation

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

VBench: Comprehensive Benchmark Suite for Video Generative Models

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

GenAI Arena: An Open Evaluation Platform for Generative Models

A Survey of AI-Generated Video Evaluation

What You See Is What Matters: A Novel Visual and Physics-Based Metric for Evaluating Video Generation Quality

EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

FaceScore: Benchmarking and Enhancing Face Quality in Human Generation

FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

Human-Activity AGV Quality Assessment: A Benchmark Dataset and an Objective Evaluation Metric

Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification

A Completely Blind Video Quality Evaluator