Physical Property Understanding from Language-Embedded Feature Fields

Albert J. Zhai,Yuan Shen,Emily Y. Chen,Gloria X. Wang,Xinlei Wang,Sheng Wang,Kaiyu Guan,Shenlong Wang
2024-04-06
Abstract:Can computers perceive the physical properties of objects solely through vision? Research in cognitive science and vision science has shown that humans excel at identifying materials and estimating their physical properties based purely on visual appearance. In this paper, we present a novel approach for dense prediction of the physical properties of objects using a collection of images. Inspired by how humans reason about physics through vision, we leverage large language models to propose candidate materials for each object. We then construct a language-embedded point cloud and estimate the physical properties of each 3D point using a zero-shot kernel regression approach. Our method is accurate, annotation-free, and applicable to any object in the open world. Experiments demonstrate the effectiveness of the proposed approach in various physical property reasoning tasks, such as estimating the mass of common objects, as well as other properties like friction and hardness.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper proposes a new method called NeRF2Physics for densely predicting the physical properties of objects from a collection of images without annotated data. The inspiration for this research comes from how humans can identify materials and estimate their physical properties based solely on visual information. The method combines language embedding feature fields and material inference based on large-scale language models to achieve zero-shot estimation of the physical properties of 3D surface points of objects and propagates these estimates throughout the entire object through spatial interpolation. Specifically, the method first uses a neural radiance field to extract the 3D point cloud of the object's surface and integrates 2D visual-language features. Then, a large-scale language model is used to generate candidate materials for each object. Next, the physical properties of each point are estimated through zero-shot kernel regression based on CLIP. This method is applicable to any object in an open world and demonstrates effectiveness in tasks such as estimating the mass, friction, and hardness of common objects. Experiments show that NeRF2Physics outperforms other zero-shot and supervised baseline methods in mass estimation tasks. In addition, the paper presents visualizations of predicted physical property fields, demonstrating the ability of the proposed method to reasonably predict various physical properties in an unsupervised manner. The related work section mentions research on visual physics reasoning and language grounding models, where CLIP, as a powerful visual-language model, has been widely applied in zero-shot and few-shot tasks. In summary, this paper addresses the problem of enabling computers to understand the physical properties of objects through visual information, particularly through unsupervised methods, to accurately predict physical properties such as mass, friction, and hardness.