Open-Ended 3D Point Cloud Instance Segmentation

Phuc D.A. Nguyen,Minh Luu,Anh Tran,Cuong Pham,Khoi Nguyen
2024-08-22
Abstract:Open-Vocab 3D Instance Segmentation methods (OV-3DIS) have recently demonstrated their ability to generalize to unseen objects. However, these methods still depend on predefined class names during testing, restricting the autonomy of agents. To mitigate this constraint, we propose a novel problem termed Open-Ended 3D Instance Segmentation (OE-3DIS), which eliminates the necessity for predefined class names during testing. Moreover, we contribute a comprehensive set of strong baselines, derived from OV-3DIS approaches and leveraging 2D Multimodal Large Language Models. To assess the performance of our OE-3DIS system, we introduce a novel Open-Ended score, evaluating both the semantic and geometric quality of predicted masks and their associated class names, alongside the standard AP score. Our approach demonstrates significant performance improvements over the baselines on the ScanNet200 and ScanNet++ datasets. Remarkably, our method surpasses the performance of Open3DIS, the current state-of-the-art method in OV-3DIS, even in the absence of ground-truth object class names.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the problem of Open-Ended 3D Point Cloud Instance Segmentation (OE-3DIS). Specifically, existing 3D instance segmentation methods (such as OV-3DIS) can generalize to unseen objects during testing but still rely on predefined category names, which limits the autonomy of autonomous agents. To solve this problem, the authors propose the OE-3DIS task, which does not require predefined category names during testing and can automatically generate instance segmentation masks and their corresponding category names from 3D point clouds. ### Main Contributions 1. **Proposing the OE-3DIS Task**: The goal of this task is to segment all instances from 3D point clouds and generate their category names without providing predefined category names during testing. 2. **Introducing the OE Score**: This score evaluates not only the Intersection over Union (IoU) between predicted instance masks and ground truth masks but also the semantic similarity between predicted category names and ground truth category names. 3. **Exploring Various Methods**: The authors explore methods using OV-3DIS and Multimodal Large Language Models (MLLMs) to achieve the OE-3DIS task. 4. **Proposing a Training-Free OE-3DIS Method**: This method achieves performance comparable to existing OV-3DIS methods by lifting 2D visual tokens to 3D space and inputting them into a pre-trained language model, without requiring additional training data. ### Experimental Results The authors conducted experiments on the ScanNet200 and ScanNet++ datasets, showing that: - On the ScanNet200 dataset, the proposed Pointwise method outperforms other baseline methods in both AP and OE scores, even surpassing the Open3DIS method that relies on ground truth category names. - On the ScanNet++ dataset, the proposed Pointwise method also performs excellently, significantly outperforming existing OV-3DIS methods and fully supervised 3D instance segmentation methods. ### Qualitative Comparison Qualitative results show that the proposed method can accurately assign category names to 3D instances, even if these category names do not exactly match the ground truth category names. Particularly in complex scenes, the method can identify new categories not present in the dataset vocabulary, such as "copier." ### Ablation Study The authors validate the effectiveness of different design choices through ablation studies, including different point feature aggregation techniques, text encoders, and text prompts. The results indicate that the weighted average aggregation technique and the Sentence Transformer text encoder perform best in terms of performance. ### Discussion and Conclusion Although the proposed method makes significant progress in open-ended 3D scene understanding, there are still some limitations, such as the reliance on 2D visual tokens and performance challenges when handling large-scale datasets. Future research can further optimize these aspects to improve the robustness and applicability of the method.