Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature Aligned Pre-Training and Region-Aware Fine-tuning

Kangcheng Liu,Yong-Jin Liu,Kai Tang,Ming Liu,Baoquan Chen
2023-12-01
Abstract:Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: <a class="link-external link-https" href="https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two aspects: 1. **Open - set Assumption**: Current 3D scene understanding methods perform poorly when encountering new categories that do not appear in the training set. These models lack the ability to recognize diverse new categories in real - world applications. Therefore, a framework that can adapt to different data distributions and recognize diverse new categories is required. 2. **Reliance on Large - scale Labeled Data**: Existing 3D scene understanding methods rely heavily on a large amount of high - quality labeled data. However, the labeling of large - scale 3D scenes is very time - consuming and labor - intensive. This makes it difficult for deep network models to perform well when the labeled data is limited. Therefore, methods that can be trained in very limited labeled scenarios, that is, data - efficient 3D scene understanding methods, need to be developed. To address these problems, the paper proposes a general and simple framework, called WS3D++, which deals with 3D scene understanding problems through hierarchical feature alignment pre - training and region - aware fine - tuning, especially in the case of limited labeled data. Specifically, the main contributions of the paper include: 1. **Pre - training stage**: - An effective knowledge distillation strategy is proposed to extract rich knowledge from large - scale vision - language models (such as CLIP) and transfer it to the 3D point cloud modality. - Rendering techniques are used to construct 2D views of large - scale 3D scenes and establish more accurate vision - language associations, thereby achieving hierarchical alignment from the global scene level to the local object level. - A word - to - 3D matching method is proposed to establish scene - level and object - level aligned language - 3D feature representations, facilitating subsequent effective contrastive learning. 2. **Fine - tuning stage**: - A region - aware energy optimization method is proposed, using boundaries as additional information to assist 3D scene segmentation and understanding. - An unsupervised region - level semantic contrastive learning strategy is proposed for multi - stage feature discrimination, fully utilizing unlabeled data in combination with supervised losses. Through the combination of these two stages, the WS3D++ framework achieves state - of - the - art performance on multiple benchmark datasets, especially in 3D semantic segmentation, 3D instance segmentation, and 3D object detection tasks.