Abstract:The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from Segment Anything (SAM) to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and ScanNet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness. Project page: <a class="link-external link-https" href="https://github.com/pcl3dv/OV-NeRF" rel="external noopener nofollow">this https URL</a>.

LERF: Language Embedded Radiance Fields

LTM-NeRF: Embedding 3D Local Tone Mapping in HDR Neural Radiance Field

NeRF-Loc: Visual Localization with Conditional Neural Radiance Field.

Leveraging Panoptic Prior for 3D Zero-Shot Semantic Understanding Within Language Embedded Radiance Fields

LaTeRF: Label and Text Driven Object Radiance Fields

CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields

OV-NeRF: Open-vocabulary Neural Radiance Fields with Vision and Language Foundation Models for 3D Semantic Understanding

Lifelong LERF: Local 3D Semantic Inventory Monitoring Using FogROS2

Open-NeRF: Towards Open Vocabulary NeRF Decomposition

CLA-NeRF: Category-Level Articulated Neural Radiance Field

OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views

Transient Neural Radiance Fields for Lidar View Synthesis and 3D Reconstruction

Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Language-enhanced RNR-Map: Querying Renderable Neural Radiance Field maps with natural language

LAENeRF: Local Appearance Editing for Neural Radiance Fields

LLaNA: Large Language and NeRF Assistant

LiDeNeRF: Neural radiance field reconstruction with depth prior provided by LiDAR point cloud

VRS-NeRF: Visual Relocalization with Sparse Neural Radiance Field

CLONeR: Camera-Lidar Fusion for Occupancy Grid-aided Neural Representations

NeRFactor

NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild