Abstract:Open-Vocab 3D Instance Segmentation methods (OV-3DIS) have recently demonstrated their ability to generalize to unseen objects. However, these methods still depend on predefined class names during testing, restricting the autonomy of agents. To mitigate this constraint, we propose a novel problem termed Open-Ended 3D Instance Segmentation (OE-3DIS), which eliminates the necessity for predefined class names during testing. Moreover, we contribute a comprehensive set of strong baselines, derived from OV-3DIS approaches and leveraging 2D Multimodal Large Language Models. To assess the performance of our OE-3DIS system, we introduce a novel Open-Ended score, evaluating both the semantic and geometric quality of predicted masks and their associated class names, alongside the standard AP score. Our approach demonstrates significant performance improvements over the baselines on the ScanNet200 and ScanNet++ datasets. Remarkably, our method surpasses the performance of Open3DIS, the current state-of-the-art method in OV-3DIS, even in the absence of ground-truth object class names.

What problem does this paper attempt to address?

The paper attempts to address the problem of Open-Ended 3D Point Cloud Instance Segmentation (OE-3DIS). Specifically, existing 3D instance segmentation methods (such as OV-3DIS) can generalize to unseen objects during testing but still rely on predefined category names, which limits the autonomy of autonomous agents. To solve this problem, the authors propose the OE-3DIS task, which does not require predefined category names during testing and can automatically generate instance segmentation masks and their corresponding category names from 3D point clouds. ### Main Contributions 1. **Proposing the OE-3DIS Task**: The goal of this task is to segment all instances from 3D point clouds and generate their category names without providing predefined category names during testing. 2. **Introducing the OE Score**: This score evaluates not only the Intersection over Union (IoU) between predicted instance masks and ground truth masks but also the semantic similarity between predicted category names and ground truth category names. 3. **Exploring Various Methods**: The authors explore methods using OV-3DIS and Multimodal Large Language Models (MLLMs) to achieve the OE-3DIS task. 4. **Proposing a Training-Free OE-3DIS Method**: This method achieves performance comparable to existing OV-3DIS methods by lifting 2D visual tokens to 3D space and inputting them into a pre-trained language model, without requiring additional training data. ### Experimental Results The authors conducted experiments on the ScanNet200 and ScanNet++ datasets, showing that: - On the ScanNet200 dataset, the proposed Pointwise method outperforms other baseline methods in both AP and OE scores, even surpassing the Open3DIS method that relies on ground truth category names. - On the ScanNet++ dataset, the proposed Pointwise method also performs excellently, significantly outperforming existing OV-3DIS methods and fully supervised 3D instance segmentation methods. ### Qualitative Comparison Qualitative results show that the proposed method can accurately assign category names to 3D instances, even if these category names do not exactly match the ground truth category names. Particularly in complex scenes, the method can identify new categories not present in the dataset vocabulary, such as "copier." ### Ablation Study The authors validate the effectiveness of different design choices through ablation studies, including different point feature aggregation techniques, text encoders, and text prompts. The results indicate that the weighted average aggregation technique and the Sentence Transformer text encoder perform best in terms of performance. ### Discussion and Conclusion Although the proposed method makes significant progress in open-ended 3D scene understanding, there are still some limitations, such as the reliance on 2D visual tokens and performance challenges when handling large-scale datasets. Future research can further optimize these aspects to improve the robustness and applicability of the method.

Open-Ended 3D Point Cloud Instance Segmentation

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

Segment Any 3D Object with Language

OccuSeg: Occupancy-Aware 3D Instance Segmentation

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

3D Object Segmentation Using Cross-Window Point Transformer with Latent Semantic Boundary Guidance

Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

SA3DIP: Segment Any 3D Instance with Potential 3D Priors

ISBNet: a 3D Point Cloud Instance Segmentation Network with Instance-aware Sampling and Box-aware Dynamic Convolution

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

Associate Semantic-Instance Segmentation of 3D Point Clouds Based on Local Feature Extraction

Search3D: Hierarchical Open-Vocabulary 3D Segmentation

ImOV3D: Learning Open-Vocabulary Point Clouds 3D Object Detection from Only 2D Images

Open-Set 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding

SAI3D: Segment Any Instance in 3D Scenes