Abstract:Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal evaluation system should provide insights to inform future developments of video generation. To this end, we present VBench, a comprehensive benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. VBench has several appealing properties: 1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation (e.g., subject identity inconsistency, motion smoothness, temporal flickering, and spatial relationship, etc). The evaluation metrics with fine-grained levels reveal individual models' strengths and weaknesses. 2) Human Alignment: We also provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception, for each evaluation dimension respectively. 3) Valuable Insights: We look into current models' ability across various evaluation dimensions, and various content types. We also investigate the gaps between video and image generation models. 4) Versatile Benchmarking: VBench++ supports evaluating text-to-video and image-to-video. We introduce a high-quality Image Suite with an adaptive aspect ratio to enable fair evaluations across different image-to-video generation settings. Beyond assessing technical quality, VBench++ evaluates the trustworthiness of video generative models, providing a more holistic view of model performance. 5) Full Open-Sourcing: We fully open-source VBench++ and continually add new video generation models to our leaderboard to drive forward the field of video generation.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in the evaluation of video generation models: 1. **Inconsistency between existing evaluation metrics and human perception**: Existing evaluation metrics (such as Inception Score, Fréchet Inception Distance, Fréchet Video Distance, and CLIPSIM) often cannot fully reflect human perception when evaluating video generation quality. This leads to the evaluation results may deviate from the actual user experience. 2. **Lack of a comprehensive evaluation framework**: Currently, there is a lack of a framework that can comprehensively and meticulously evaluate the performance of video generation models. Ideally, the evaluation system should be able to reveal the specific strengths and weaknesses of each model, thereby providing valuable guidance for the development of future models. 3. **Need for multi - dimensional evaluation**: The quality of video generation is a complex and multi - faceted concept, and the video attributes that people care about may be different in different application contexts. Therefore, a multi - dimensional evaluation framework is required to comprehensively measure the quality of video generation. 4. **Challenges in evaluating image - to - video generation models**: The evaluation of image - to - video generation models (I2V) faces additional challenges. For example, the selection of input images and the configured video resolution will have a significant impact on the generation results. A high - quality image suite that adapts to different settings is required to fairly evaluate these models. 5. **Considering the credibility of the model**: In addition to technical performance, video generation models should also ensure that the content they generate is fair in different cultures and demographics, and avoid generating harmful or offensive content. This is especially important for the application of models in fields such as social media broadcasting and education. To this end, the paper proposes **VBench++**, which is a comprehensive and multi - purpose benchmark suite for evaluating the performance of video generation models. The main features of VBench++ include: - **Comprehensive evaluation dimensions**: VBench++ decomposes "video generation quality" into 16 specific, hierarchical, and independent dimensions, each of which has a dedicated prompt and evaluation method. - **Consistency with human perception**: By collecting human preference annotation data, verify the consistency between each evaluation dimension and human perception. - **Valuable insights**: By analyzing the performance of current models on various evaluation dimensions, provide detailed feedback on model capabilities and guide future model training and architecture selection. - **Multi - purpose benchmarking**: Support multiple video generation tasks, including text - to - video and image - to - video, and provide a high - quality image suite that adapts to different settings. - **Fully open - source**: All components of VBench++, including the prompt suite, image suite, evaluation method, generated videos, and human preference annotation data, are fully open - source. Through these features, VBench++ aims to provide a more comprehensive, detailed, and human - perception - consistent framework for the evaluation of video generation models.

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

VBench: Comprehensive Benchmark Suite for Video Generative Models

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

VideoEval: Comprehensive Benchmark Suite for Low-Cost Evaluation of Video Foundation Model

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation

UGC-VQA: Benchmarking Blind Video Quality Assessment for User Generated Content

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

EditBoard: Towards A Comprehensive Evaluation Benchmark for Text-based Video Editing Models

AIGCBench: Comprehensive Evaluation of Image-to-Video Content Generated by AI

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

LVBench: An Extreme Long Video Understanding Benchmark

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Video Quality Assessment: A Comprehensive Survey

Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs