Abstract:Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic representation learning from captions for object-level categorization, we design hierarchical point-caption association methods to learn semantic-aware embeddings that exploit the 3D geometry between 3D points and multi-view images. In addition, to tackle the localization challenge for novel classes in the open-world setting, we develop debiased instance localization, which involves training object grouping modules on unlabeled data using instance-level pseudo supervision. This significantly improves the generalization capabilities of instance grouping and thus the ability to accurately locate novel objects. We conduct extensive experiments on 3D semantic, instance, and panoptic segmentation tasks, covering indoor and outdoor scenes across three datasets. Our method outperforms baseline methods by a significant margin in semantic segmentation (e.g. 34.5%$\sim$65.3%), instance segmentation (e.g. 21.8%$\sim$54.0%) and panoptic segmentation (e.g. 14.7%$\sim$43.3%). Code will be available.

Cross-View Semantic Segmentation for Sensing Surroundings

Can We PASS Beyond the Field of View? Panoramic Annular Semantic Segmentation for Real-World Surrounding Perception

Unifying Terrain Awareness Through Real-Time Semantic Segmentation

In Defense Of Multi-Source Omni-Supervised Efficient Convnet For Robust Semantic Segmentation In Heterogeneous Unseen Domains

PASS: Panoramic Annular Semantic Segmentation

Omnisupervised Omnidirectional Semantic Segmentation

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Robustifying Semantic Cognition of Traversability Across Wearable RGB-depth Cameras

Semantic perception of curbs beyond traversability for real-world navigation assistance systems

CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers

Active Scene Understanding via Online Semantic Reconstruction

Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation

CV-MOS: A Cross-View Model for Motion Segmentation

Learning Spatial-Semantic Features for Robust Video Object Segmentation

Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction

Cross-view Transformers for real-time Map-view Semantic Segmentation

Panoramic Panoptic Segmentation: Insights Into Surrounding Parsing for Mobile Agents via Unsupervised Contrastive Learning

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Scene-Centric Joint Parsing of Cross-View Videos.

ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

An Approach for Construct Semantic Map with Scene Classification and Object Semantic Segmentation