Abstract:To ease the difficulty of acquiring annotation labels in 3D data, a common method is using unsupervised and open-vocabulary semantic segmentation, which leverage 2D CLIP semantic knowledge. In this paper, unlike previous research that ignores the ``noise'' raised during feature projection from 2D to 3D, we propose a novel distillation learning framework named CUS3D. In our approach, an object-level denosing projection module is designed to screen out the ``noise'' and ensure more accurate 3D feature. Based on the obtained features, a multimodal distillation learning module is designed to align the 3D feature with CLIP semantic feature space with object-centered constrains to achieve advanced unsupervised semantic segmentation. We conduct comprehensive experiments in both unsupervised and open-vocabulary segmentation, and the results consistently showcase the superiority of our model in achieving advanced unsupervised segmentation results and its effectiveness in open-vocabulary segmentation.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of the difficulty in obtaining labeled data in 3D point cloud semantic segmentation. Specifically, the authors propose a new framework named CUS3D for unsupervised 3D semantic segmentation through object - level denoising. #### Main problems: 1. **Difficulty in obtaining labeled tags**: Obtaining annotation information in 3D data is very difficult, resulting in a lack of sufficient labeled data in the field of 3D point cloud semantic segmentation, which limits the research and development in this field. 2. **The "noise" problem in existing methods**: Existing unsupervised 3D semantic segmentation methods introduce a large amount of "noise" when projecting 2D features into 3D space, and these "noise" affect the final segmentation accuracy. #### Solutions: To address these problems, the CUS3D framework proposes the following innovations: 1. **Object - level Denoising Projection (ODP) module**: - An object - level denoising projection module is designed to filter out the "noise" introduced in the 2D and 3D stages, so as to obtain more accurate 3D features. - Through efficient clustering and voting strategies, ensure that the pixels or points within each object have the same characteristics and reduce the influence of "noise". 2. **Multi - modal Distillation Learning (MDL) module**: - A multi - modal distillation learning module is introduced, using object - centered constraints to make the 2D and 3D semantic spaces as close as possible, further filtering out the influence of "noise". - Through knowledge distillation, make the 3D model better align with the CLIP semantic space, thereby improving the segmentation accuracy. #### Experimental results: - Experiments on the ScanNetV2 and S3DIS datasets show that CUS3D has achieved state - of - the - art performance in unsupervised semantic segmentation tasks. - It also performs well in open - vocabulary semantic segmentation tasks, proving its robustness and effectiveness. In conclusion, the CUS3D framework significantly improves the accuracy of unsupervised 3D semantic segmentation by effectively aligning 3D features and 2D CLIP semantic spaces and through object - level denoising.

CUS3D :CLIP-based Unsupervised 3D Segmentation via Object-level Denoise

Pass3d: Precise And Accelerated Semantic Segmentation For 3d Point Cloud

Superpoint-guided Semi-supervised Semantic Segmentation of 3D Point Clouds

3D Object Segmentation Using Cross-Window Point Transformer with Latent Semantic Boundary Guidance

Transferring CLIP's Knowledge into Zero-Shot Point Cloud Semantic Segmentation

PointDC:Unsupervised Semantic Segmentation of 3D Point Clouds Via Cross-modal Distillation and Super-Voxel Clustering

CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation

CUS3D: A New Comprehensive Urban-Scale Semantic-Segmentation 3D Benchmark Dataset

Weakly Supervised 3D Instance Segmentation without Instance-level Annotations

CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP

CLIP2UDA: Making Frozen CLIP Reward Unsupervised Domain Adaptation in 3D Semantic Segmentation

CSD3D: Cross-Scale Distillation Via Dual-Consistency Learning for Semi-Supervised 3D Object Detection

3D Guided Weakly Supervised Semantic Segmentation

Geometry and Uncertainty-Aware 3D Point Cloud Class-Incremental Semantic Segmentation

SA3DIP: Segment Any 3D Instance with Potential 3D Priors

3DCFS: Fast and Robust Joint 3D Semantic-Instance Segmentation via Coupled Feature Selection