Abstract:Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive \textbf{Phy}sics \textbf{Gen}eration \textbf{Ben}chmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes at <a class="link-external link-https" href="https://github.com/OpenGVLab/PhyGenBench" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to address the deficiencies of current Text - to - Video (T2V) generation models in generating videos that conform to physical common sense. Specifically, the paper points out: 1. **Understanding of physical common sense**: Although existing T2V models such as Sora have made significant progress in visualizing complex prompts, these models are still insufficient in accurately representing intuitive physical phenomena. Cognitive psychology believes that understanding intuitive physics is the basis for constructing a general - purpose world simulator, and existing models still have a large gap in this regard. 2. **Lack of evaluation framework**: Current T2V evaluation benchmarks mainly focus on the quality of generated videos (such as motion smoothness, background consistency) or spatial relationships, and fail to effectively evaluate whether the generated videos conform to basic physical laws. This has led to insufficient evaluation of the models' physical common - sense understanding ability. To fill this gap, the paper makes two main contributions: 1. **PhyGenBench**: A comprehensive physical generation benchmark, containing 160 carefully designed prompts, covering 27 different physical laws, involving four basic fields of mechanics, optics, thermodynamics, and material properties. These prompts are designed to evaluate the models' understanding ability of physical common sense. 2. **PhyGenEval**: A novel evaluation framework that utilizes a hierarchical structure and advanced Vision - Language Models (VLMs) and Large Language Models (LLMs) to evaluate the physical common sense in generated videos. Through PhyGenEval, large - scale automated evaluation can be carried out, and the results are highly consistent with human feedback. ### Specific problems - **Accurate generation of physical phenomena**: Current models still have difficulties in generating certain simple physical phenomena. For example, a stone should sink in water instead of floating. These problems reveal a significant gap in the models' understanding of physical common sense. - **Limitations of evaluation methods**: Traditional evaluation metrics (such as FVD) have limitations in detecting unreasonable motions and require reference videos, which are difficult to obtain in new scenarios. PhyGenEval overcomes these limitations by a multi - stage evaluation strategy, gradually verifying the correctness of physical common sense from single - frame images to complete videos. ### Conclusion By proposing PhyGenBench and PhyGenEval, the paper provides a comprehensive and automated solution for evaluating the physical common - sense understanding ability of T2V models. Experimental results show that even the most advanced models (such as Gen - 3) are far from meeting the requirements of a world simulator in terms of physical common - sense understanding. The author hopes that this research can inspire the community to pay more attention to the research on models' physical common - sense learning, rather than simply regarding them as entertainment tools.

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

VideoPhy: Evaluating Physical Commonsense for Video Generation

PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models

How Far is Video Generation from World Model: A Physical Law Perspective

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

WorldSimBench: Towards Video Generation Models as World Simulators

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

VBench: Comprehensive Benchmark Suite for Video Generative Models

Sora Generates Videos with Stunning Geometrical Consistency

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

Teaching Video Diffusion Model with Latent Physical Phenomenon Knowledge

TPA-Net: Generate A Dataset for Text to Physics-based Animation

T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

X-VoE: Measuring eXplanatory Violation of Expectation in Physical Events

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation