Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng,Jiaqi Liao,Xinyu Tan,Wenqi Shao,Quanfeng Lu,Kaipeng Zhang,Yu Cheng,Dianqi Li,Yu Qiao,Ping Luo
2024-10-08
Abstract:Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive \textbf{Phy}sics \textbf{Gen}eration \textbf{Ben}chmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes at <a class="link-external link-https" href="https://github.com/OpenGVLab/PhyGenBench" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to address the deficiencies of current Text - to - Video (T2V) generation models in generating videos that conform to physical common sense. Specifically, the paper points out: 1. **Understanding of physical common sense**: Although existing T2V models such as Sora have made significant progress in visualizing complex prompts, these models are still insufficient in accurately representing intuitive physical phenomena. Cognitive psychology believes that understanding intuitive physics is the basis for constructing a general - purpose world simulator, and existing models still have a large gap in this regard. 2. **Lack of evaluation framework**: Current T2V evaluation benchmarks mainly focus on the quality of generated videos (such as motion smoothness, background consistency) or spatial relationships, and fail to effectively evaluate whether the generated videos conform to basic physical laws. This has led to insufficient evaluation of the models' physical common - sense understanding ability. To fill this gap, the paper makes two main contributions: 1. **PhyGenBench**: A comprehensive physical generation benchmark, containing 160 carefully designed prompts, covering 27 different physical laws, involving four basic fields of mechanics, optics, thermodynamics, and material properties. These prompts are designed to evaluate the models' understanding ability of physical common sense. 2. **PhyGenEval**: A novel evaluation framework that utilizes a hierarchical structure and advanced Vision - Language Models (VLMs) and Large Language Models (LLMs) to evaluate the physical common sense in generated videos. Through PhyGenEval, large - scale automated evaluation can be carried out, and the results are highly consistent with human feedback. ### Specific problems - **Accurate generation of physical phenomena**: Current models still have difficulties in generating certain simple physical phenomena. For example, a stone should sink in water instead of floating. These problems reveal a significant gap in the models' understanding of physical common sense. - **Limitations of evaluation methods**: Traditional evaluation metrics (such as FVD) have limitations in detecting unreasonable motions and require reference videos, which are difficult to obtain in new scenarios. PhyGenEval overcomes these limitations by a multi - stage evaluation strategy, gradually verifying the correctness of physical common sense from single - frame images to complete videos. ### Conclusion By proposing PhyGenBench and PhyGenEval, the paper provides a comprehensive and automated solution for evaluating the physical common - sense understanding ability of T2V models. Experimental results show that even the most advanced models (such as Gen - 3) are far from meeting the requirements of a world simulator in terms of physical common - sense understanding. The author hopes that this research can inspire the community to pay more attention to the research on models' physical common - sense learning, rather than simply regarding them as entertainment tools.