Abstract:Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: <a class="link-external link-https" href="https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing" rel="external noopener nofollow">this https URL</a>.

3D Scene Parsing via Class-Wise Adaptation

Learning to Simulate Complex Scenes for Street Scene Segmentation

Learning 3 D Scene Synthesis from Annotated RGB-D Images

3D-to-2D Distillation for Indoor Scene Parsing

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

A Curriculum Domain Adaptation Approach to the Semantic Segmentation of Urban Scenes

An Approach for Construct Semantic Map with Scene Classification and Object Semantic Segmentation

Single-Image 3D Scene Parsing Using Geometric Commonsense

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud

SSR-2D: Semantic 3D Scene Reconstruction from 2D Images

Model2Scene: Learning 3D Scene Representation via Contrastive Language-CAD Models Pre-training

Learning 3D Semantic Scene Graphs From 3D Indoor Reconstructions

Geometry-semantic Aware for Monocular 3D Semantic Scene Completion

Learning to Synthesize 3D Indoor Scenes from Monocular Images.

3DCNN-DQN-RNN: A Deep Reinforcement Learning Framework for Semantic Parsing of Large-Scale 3D Point Clouds

3D Face Parsing via Surface Parameterization and 2D Semantic Segmentation Network

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

MAP-ADAPT: Real-Time Quality-Adaptive Semantic 3D Maps

Large-Scale 3D Semantic Mapping Using Monocular Vision