Abstract:Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: <a class="link-external link-https" href="https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing" rel="external noopener nofollow">this https URL</a>.

Mining Regional Relation from Pixel-wise Annotation for Scene Parsing

Interaction via Bi-directional Graph of Semantic Region Affinity for Scene Parsing

Multi-Branch Adaptive Hard Region Mining Network for Urban Scene Parsing of High-Resolution Remote-Sensing Images

Global-residual and Local-boundary Refinement Networks for Rectifying Scene Parsing Predictions

Dual Relation Network for Scene Text Recognition

MS-RRFSegNet: Multiscale Regional Relation Feature Segmentation Network for Semantic Segmentation of Urban Scene Point Clouds.

Regional Relation Modeling for Visual Place Recognition

Objectness Region Enhancement Networks for Scene Parsing

High Resolution Scene Parsing Network Based on Semantic Segmentation

Channel and Spatial Enhancement Network for human parsing

Adaptive Context Network for Scene Parsing

Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition

Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Robust Scene Parsing by Mining Supportive Knowledge From Dataset

Scene Parsing from an MAP Perspective.

ORDNet: Capturing Omni-Range Dependencies for Scene Parsing.

Exploring the Relationships of Regions for Visual Content Understanding

Depth Embedded Recurrent Predictive Parsing Network for Video Scenes

Multi-layer Feature Aggregation for Deep Scene Parsing Models

Visual Relationship Detection with Relative Location Mining

IDRNet: Intervention-Driven Relation Network for Semantic Segmentation