PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

Shaowei Liu,Zhongzheng Ren,Saurabh Gupta,Shenlong Wang
2024-09-28
Abstract:We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e.g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through quantitative comparison and comprehensive user study. PhysGen's resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics. Project page: <a class="link-external link-https" href="https://stevenlsw.github.io/physgen/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper attempts to address the problem of generating realistic and physically plausible videos from a single image. Specifically, the authors propose a new method called **PhysGen**, which can transform a static image and an input condition (such as forces and torques applied to objects in the image) into a realistic, physically plausible, and temporally coherent video. The main contributions of the paper are: 1. **Physical Parameter Inference**: Inferring the geometry, material, and physical parameters of objects from a single image. 2. **Physics-Based Dynamics Simulation**: Using rigid body physics and inferred parameters to simulate realistic object movements and interactions. 3. **Generative Rendering and Refinement**: Combining generative video diffusion models to produce realistic and physically plausible videos. ### Main Issues and Challenges Existing image-to-video generation methods have the following issues: - **Lack of Physical Realism**: Current data-driven methods often generate videos that lack temporal coherence and realistic object movements. - **Lack of Controllability**: These methods cannot precisely control object movements, such as the effects of different forces and torques on objects. - **Dependence on Large Training Data**: Existing generative models require a large amount of training data, which may not be feasible in practical applications. ### Solution **PhysGen** addresses the above issues through the following three core components: 1. **Image Understanding Module**: Effectively captures the geometry, material, and physical parameters of objects from the input image. 2. **Image-Space Dynamics Simulation Model**: Uses rigid body physics and inferred parameters to simulate realistic behaviors. 3. **Image-Based Rendering and Refinement Module**: Utilizes generative video diffusion models to produce realistic and physically plausible videos. ### Experimental Results The authors evaluated the generative capabilities of **PhysGen** on multiple data sources, including internet data and self-captured indoor images. Experimental results show that **PhysGen** ranks first in user evaluations of physical realism and photorealism, and also performs well in quantitative evaluations, generating videos with low image FID and motion FID. ### Conclusion **PhysGen** combines learning-based generative methods with traditional model-based physical simulation, enabling the generation of realistic and physically plausible videos without any training. This approach brings new breakthroughs to the field of image-to-video generation, especially in applications requiring physical realism, such as scientific discovery and robotics.