Abstract:In recent years, there has been a surge of interest in open-vocabulary 3D scene reconstruction facilitated by visual language models (VLMs), which showcase remarkable capabilities in open-set retrieval. However, existing methods face some limitations: they either focus on learning point-wise features, resulting in blurry semantic understanding, or solely tackle object-level reconstruction, thereby overlooking the intricate details of the object's interior. To address these challenges, we introduce OpenObj, an innovative approach to build open-vocabulary object-level Neural Radiance Fields (NeRF) with fine-grained understanding. In essence, OpenObj establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level. Moreover, we incorporate part-level features into the neural fields, enabling a nuanced representation of object interiors. This approach captures object-level instances while maintaining a fine-grained understanding. The results on multiple datasets demonstrate that OpenObj achieves superior performance in zero-shot semantic segmentation and retrieval tasks. Additionally, OpenObj supports real-world robotics tasks at multiple scales, including global movement and local manipulation.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on several key limitations in the existing open - vocabulary 3D scene reconstruction methods: 1. **Learning of point - level features**: Existing methods often focus on learning point - level features, which leads to ambiguous semantic understanding. For example, although it can be identified that there is an object in a certain area, it is unable to describe the specific parts or internal structure of the object in detail. 2. **Limitations of object - level reconstruction**: Some methods only focus on object - level reconstruction and ignore the complex details inside the object. This means that they perform poorly when dealing with tasks that require fine - grained operations, such as a robot grasping a specific part of an object. 3. **Lack of fine - grained understanding**: Existing open - vocabulary mapping methods can usually only perform scene understanding at the object level and cannot provide a more detailed understanding of the internal structure, especially in tasks involving specific operations (such as grasping). To address these challenges, the paper proposes **OpenObj**, an innovative method for constructing an open - vocabulary object - level neural radiance field (NeRF) with fine - grained understanding. The main goals of OpenObj are: - **Establish a robust framework**: Achieve efficient and rigorous scene modeling and understanding, especially at the object level. - **Integrate part - level features**: Incorporate part - level features into the neural field to achieve a detailed representation of the interior of the object. - **Support multi - scale tasks**: Be able to perform retrieval and navigation not only at the object level but also support the representation and manipulation of specific objects. Through these improvements, OpenObj can achieve superior performance in zero - shot semantic segmentation and retrieval tasks and support practical robot tasks, including global motion and local manipulation.

OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding

Open-NeRF: Towards Open Vocabulary NeRF Decomposition

OpenOcc: Open Vocabulary 3D Scene Reconstruction via Occupancy Representation

OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding

OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views

Neural Rendering in a Room: Amodal 3D Understanding and Free-Viewpoint Rendering for the Closed Scene Composed of Pre-Captured Objects

NeRF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

Omni-Recon: Harnessing Image-based Rendering for General-Purpose Neural Radiance Fields

OpenScene: 3D Scene Understanding with Open Vocabularies

Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Obj-NeRF: Extract Object NeRFs from Multi-view Images

OVExp: Open Vocabulary Exploration for Object-Oriented Navigation

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models

OpenSight: A Simple Open-Vocabulary Framework for LiDAR-Based Object Detection

OmniNeRF: Hybriding Omnidirectional Distance and Radiance fields for Neural Surface Reconstruction

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation