Abstract:Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive \textbf{Phy}sics \textbf{Gen}eration \textbf{Ben}chmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes at <a class="link-external link-https" href="https://github.com/OpenGVLab/PhyGenBench" rel="external noopener nofollow">this https URL</a>

PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

TPA-Net: Generate A Dataset for Text to Physics-based Animation

Searching Priors Makes Text-to-Video Synthesis Better

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

VideoPhy: Evaluating Physical Commonsense for Video Generation

BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

Text-Animator: Controllable Visual Text Video Generation

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

Motion Control for Enhanced Complex Action Video Generation

Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide