OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding

Yinan Deng,Jiahui Wang,Jingyu Zhao,Jianyu Dou,Yi Yang,Yufeng Yue
2024-06-12
Abstract:In recent years, there has been a surge of interest in open-vocabulary 3D scene reconstruction facilitated by visual language models (VLMs), which showcase remarkable capabilities in open-set retrieval. However, existing methods face some limitations: they either focus on learning point-wise features, resulting in blurry semantic understanding, or solely tackle object-level reconstruction, thereby overlooking the intricate details of the object's interior. To address these challenges, we introduce OpenObj, an innovative approach to build open-vocabulary object-level Neural Radiance Fields (NeRF) with fine-grained understanding. In essence, OpenObj establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level. Moreover, we incorporate part-level features into the neural fields, enabling a nuanced representation of object interiors. This approach captures object-level instances while maintaining a fine-grained understanding. The results on multiple datasets demonstrate that OpenObj achieves superior performance in zero-shot semantic segmentation and retrieval tasks. Additionally, OpenObj supports real-world robotics tasks at multiple scales, including global movement and local manipulation.
Computer Vision and Pattern Recognition,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on several key limitations in the existing open - vocabulary 3D scene reconstruction methods: 1. **Learning of point - level features**: Existing methods often focus on learning point - level features, which leads to ambiguous semantic understanding. For example, although it can be identified that there is an object in a certain area, it is unable to describe the specific parts or internal structure of the object in detail. 2. **Limitations of object - level reconstruction**: Some methods only focus on object - level reconstruction and ignore the complex details inside the object. This means that they perform poorly when dealing with tasks that require fine - grained operations, such as a robot grasping a specific part of an object. 3. **Lack of fine - grained understanding**: Existing open - vocabulary mapping methods can usually only perform scene understanding at the object level and cannot provide a more detailed understanding of the internal structure, especially in tasks involving specific operations (such as grasping). To address these challenges, the paper proposes **OpenObj**, an innovative method for constructing an open - vocabulary object - level neural radiance field (NeRF) with fine - grained understanding. The main goals of OpenObj are: - **Establish a robust framework**: Achieve efficient and rigorous scene modeling and understanding, especially at the object level. - **Integrate part - level features**: Incorporate part - level features into the neural field to achieve a detailed representation of the interior of the object. - **Support multi - scale tasks**: Be able to perform retrieval and navigation not only at the object level but also support the representation and manipulation of specific objects. Through these improvements, OpenObj can achieve superior performance in zero - shot semantic segmentation and retrieval tasks and support practical robot tasks, including global motion and local manipulation.