Abstract:We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e.g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through quantitative comparison and comprehensive user study. PhysGen's resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics. Project page: <a class="link-external link-https" href="https://stevenlsw.github.io/physgen/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper attempts to address the problem of generating realistic and physically plausible videos from a single image. Specifically, the authors propose a new method called **PhysGen**, which can transform a static image and an input condition (such as forces and torques applied to objects in the image) into a realistic, physically plausible, and temporally coherent video. The main contributions of the paper are: 1. **Physical Parameter Inference**: Inferring the geometry, material, and physical parameters of objects from a single image. 2. **Physics-Based Dynamics Simulation**: Using rigid body physics and inferred parameters to simulate realistic object movements and interactions. 3. **Generative Rendering and Refinement**: Combining generative video diffusion models to produce realistic and physically plausible videos. ### Main Issues and Challenges Existing image-to-video generation methods have the following issues: - **Lack of Physical Realism**: Current data-driven methods often generate videos that lack temporal coherence and realistic object movements. - **Lack of Controllability**: These methods cannot precisely control object movements, such as the effects of different forces and torques on objects. - **Dependence on Large Training Data**: Existing generative models require a large amount of training data, which may not be feasible in practical applications. ### Solution **PhysGen** addresses the above issues through the following three core components: 1. **Image Understanding Module**: Effectively captures the geometry, material, and physical parameters of objects from the input image. 2. **Image-Space Dynamics Simulation Model**: Uses rigid body physics and inferred parameters to simulate realistic behaviors. 3. **Image-Based Rendering and Refinement Module**: Utilizes generative video diffusion models to produce realistic and physically plausible videos. ### Experimental Results The authors evaluated the generative capabilities of **PhysGen** on multiple data sources, including internet data and self-captured indoor images. Experimental results show that **PhysGen** ranks first in user evaluations of physical realism and photorealism, and also performs well in quantitative evaluations, generating videos with low image FID and motion FID. ### Conclusion **PhysGen** combines learning-based generative methods with traditional model-based physical simulation, enabling the generation of realistic and physically plausible videos without any training. This approach brings new breakthroughs to the field of image-to-video generation, especially in applications requiring physical realism, such as scientific discovery and robotics.

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

PhysMotion: Physics-Grounded Dynamics From a Single Image

Phys4DGen: A Physics-Driven Framework for Controllable and Efficient 4D Content Generation from a Single Image

Phy124: Fast Physics-Driven 4D Content Generation from a Single Image

Physics-based Human Motion Estimation and Synthesis from Videos

PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

VideoPhy: Evaluating Physical Commonsense for Video Generation

PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics

How Far is Video Generation from World Model: A Physical Law Perspective

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

MotionCraft: Physics-based Zero-Shot Video Generation

Sync4D: Video Guided Controllable Dynamics for Physics-Based 4D Generation

Motion Prompting: Controlling Video Generation with Motion Trajectories

4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency

3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes

Sora Generates Videos with Stunning Geometrical Consistency

SmPhy: Generating Smooth and Physically Plausible 3D Garment Animations

GenDeF: Learning Generative Deformation Field for Video Generation

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models