Abstract:Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic representation learning from captions for object-level categorization, we design hierarchical point-caption association methods to learn semantic-aware embeddings that exploit the 3D geometry between 3D points and multi-view images. In addition, to tackle the localization challenge for novel classes in the open-world setting, we develop debiased instance localization, which involves training object grouping modules on unlabeled data using instance-level pseudo supervision. This significantly improves the generalization capabilities of instance grouping and thus the ability to accurately locate novel objects. We conduct extensive experiments on 3D semantic, instance, and panoptic segmentation tasks, covering indoor and outdoor scenes across three datasets. Our method outperforms baseline methods by a significant margin in semantic segmentation (e.g. 34.5%$\sim$65.3%), instance segmentation (e.g. 21.8%$\sim$54.0%) and panoptic segmentation (e.g. 14.7%$\sim$43.3%). Code will be available.

LoCUS: Learning Multiscale 3D-consistent Features from Posed Images

CenterLPS: Segment Instances by Centers for LiDAR Panoptic Segmentation

Leveraging Local Planar Motion Property for Robust Visual Matching and Localization.

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Locus: LiDAR-based Place Recognition using Spatiotemporal Higher-Order Pooling

Large-Scale 3D Semantic Mapping Using Monocular Vision

Learning high-level features by fusing multi-view representation of MLS point clouds for 3D object recognition in road environments

Leveraging Panoptic Prior for 3D Zero-Shot Semantic Understanding Within Language Embedded Radiance Fields

Representing and Recognizing Objects with Massive Local Image Patches.

DeLS-3D: Deep Localization and Segmentation with a 3D Semantic Map

3D Object Recognition By Corresponding and Quantizing Neural 3D Scene Representations

Local-to-Global Semantic Learning for Multi-View 3D Object Detection from Point Cloud

Multi-level 3D CNN for Learning Multi-scale Spatial Features

Visual Landmark Learning Via Attention-Based Deep Neural Networks.

Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views

Localization, balance and affinity: a stronger multifaceted collaborative salient object detector in remote sensing images

Monocular Visual Place Recognition in LiDAR Maps via Cross-Modal State Space Model and Multi-View Matching

CurriculumLoc: Enhancing Cross-Domain Geolocalization Through Multistage Refinement

3M3D: Multi-view, Multi-path, Multi-representation for 3D Object Detection

Multi-Scale Point-Wise Convolutional Neural Networks for 3D Object Segmentation From LiDAR Point Clouds in Large-Scale Environments

(LC)$^2$: LiDAR-Camera Loop Constraints For Cross-Modal Place Recognition