Physical Property Understanding from Language-Embedded Feature Fields

Albert J. Zhai,Yuan Shen,Emily Y. Chen,Gloria X. Wang,Xinlei Wang,Sheng Wang,Kaiyu Guan,Shenlong Wang

2024-04-06

Abstract:Can computers perceive the physical properties of objects solely through vision? Research in cognitive science and vision science has shown that humans excel at identifying materials and estimating their physical properties based purely on visual appearance. In this paper, we present a novel approach for dense prediction of the physical properties of objects using a collection of images. Inspired by how humans reason about physics through vision, we leverage large language models to propose candidate materials for each object. We then construct a language-embedded point cloud and estimate the physical properties of each 3D point using a zero-shot kernel regression approach. Our method is accurate, annotation-free, and applicable to any object in the open world. Experiments demonstrate the effectiveness of the proposed approach in various physical property reasoning tasks, such as estimating the mass of common objects, as well as other properties like friction and hardness.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

This paper proposes a new method called NeRF2Physics for densely predicting the physical properties of objects from a collection of images without annotated data. The inspiration for this research comes from how humans can identify materials and estimate their physical properties based solely on visual information. The method combines language embedding feature fields and material inference based on large-scale language models to achieve zero-shot estimation of the physical properties of 3D surface points of objects and propagates these estimates throughout the entire object through spatial interpolation. Specifically, the method first uses a neural radiance field to extract the 3D point cloud of the object's surface and integrates 2D visual-language features. Then, a large-scale language model is used to generate candidate materials for each object. Next, the physical properties of each point are estimated through zero-shot kernel regression based on CLIP. This method is applicable to any object in an open world and demonstrates effectiveness in tasks such as estimating the mass, friction, and hardness of common objects. Experiments show that NeRF2Physics outperforms other zero-shot and supervised baseline methods in mass estimation tasks. In addition, the paper presents visualizations of predicted physical property fields, demonstrating the ability of the proposed method to reasonably predict various physical properties in an unsupervised manner. The related work section mentions research on visual physics reasoning and language grounding models, where CLIP, as a powerful visual-language model, has been widely applied in zero-shot and few-shot tasks. In summary, this paper addresses the problem of enabling computers to understand the physical properties of objects through visual information, particularly through unsupervised methods, to accurately predict physical properties such as mass, friction, and hardness.

Physical Property Understanding from Language-Embedded Feature Fields

GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs

Can Language Models Understand Physical Concepts?

Visualizing the Obvious: A Concreteness-based Ensemble Model for Noun Property Prediction

Identifying Terrain Physical Parameters from Vision -- Towards Physical-Parameter-Aware Locomotion and Navigation

Probing the Link Between Vision and Language in Material Perception Using Psychophysics and Unsupervised Learning

Intrinsic Physical Concepts Discovery with Object-Centric Predictive Models

You've Got to Feel It To Believe It: Multi-Modal Bayesian Inference for Semantic and Property Prediction

A Vision-Based Two-Stage Framework for Inferring Physical Properties of the Terrain

Predictive Visuo-Tactile Interactive Perception Framework for Object Properties Inference

Octopi: Object Property Reasoning with Large Tactile-Language Models

Compositional Physical Reasoning of Objects and Events from Videos

A General Protocol to Probe Large Vision Models for 3D Physical Understanding

Gaussian-Informed Continuum for Physical Property Identification and Simulation

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

Physically Grounded Vision-Language Models for Robotic Manipulation

Recognizing Material Properties from Images

Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces From Images

VIPHY: Probing "Visible" Physical Commonsense Knowledge

GIC: Gaussian-Informed Continuum for Physical Property Identification and Simulation