CUS3D :CLIP-based Unsupervised 3D Segmentation via Object-level Denoise

Fuyang Yu,Runze Tian,Zhen Wang,Xiaochuan Wang,Xiaohui Liang
2024-09-21
Abstract:To ease the difficulty of acquiring annotation labels in 3D data, a common method is using unsupervised and open-vocabulary semantic segmentation, which leverage 2D CLIP semantic knowledge. In this paper, unlike previous research that ignores the ``noise'' raised during feature projection from 2D to 3D, we propose a novel distillation learning framework named CUS3D. In our approach, an object-level denosing projection module is designed to screen out the ``noise'' and ensure more accurate 3D feature. Based on the obtained features, a multimodal distillation learning module is designed to align the 3D feature with CLIP semantic feature space with object-centered constrains to achieve advanced unsupervised semantic segmentation. We conduct comprehensive experiments in both unsupervised and open-vocabulary segmentation, and the results consistently showcase the superiority of our model in achieving advanced unsupervised segmentation results and its effectiveness in open-vocabulary segmentation.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of the difficulty in obtaining labeled data in 3D point cloud semantic segmentation. Specifically, the authors propose a new framework named CUS3D for unsupervised 3D semantic segmentation through object - level denoising. #### Main problems: 1. **Difficulty in obtaining labeled tags**: Obtaining annotation information in 3D data is very difficult, resulting in a lack of sufficient labeled data in the field of 3D point cloud semantic segmentation, which limits the research and development in this field. 2. **The "noise" problem in existing methods**: Existing unsupervised 3D semantic segmentation methods introduce a large amount of "noise" when projecting 2D features into 3D space, and these "noise" affect the final segmentation accuracy. #### Solutions: To address these problems, the CUS3D framework proposes the following innovations: 1. **Object - level Denoising Projection (ODP) module**: - An object - level denoising projection module is designed to filter out the "noise" introduced in the 2D and 3D stages, so as to obtain more accurate 3D features. - Through efficient clustering and voting strategies, ensure that the pixels or points within each object have the same characteristics and reduce the influence of "noise". 2. **Multi - modal Distillation Learning (MDL) module**: - A multi - modal distillation learning module is introduced, using object - centered constraints to make the 2D and 3D semantic spaces as close as possible, further filtering out the influence of "noise". - Through knowledge distillation, make the 3D model better align with the CLIP semantic space, thereby improving the segmentation accuracy. #### Experimental results: - Experiments on the ScanNetV2 and S3DIS datasets show that CUS3D has achieved state - of - the - art performance in unsupervised semantic segmentation tasks. - It also performs well in open - vocabulary semantic segmentation tasks, proving its robustness and effectiveness. In conclusion, the CUS3D framework significantly improves the accuracy of unsupervised 3D semantic segmentation by effectively aligning 3D features and 2D CLIP semantic spaces and through object - level denoising.