Abstract:We study open-world part segmentation in 3D: segmenting any part in any object based on any text query. Prior methods are limited in object categories and part vocabularies. Recent advances in AI have demonstrated effective open-world recognition capabilities in 2D. Inspired by this progress, we propose an open-world, direct-prediction model for 3D part segmentation that can be applied zero-shot to any object. Our approach, called Find3D, trains a general-category point embedding model on large-scale 3D assets from the internet without any human annotation. It combines a data engine, powered by foundation models for annotating data, with a contrastive training method. We achieve strong performance and generalization across multiple datasets, with up to a 3x improvement in mIoU over the next best method. Our model is 6x to over 300x faster than existing baselines. To encourage research in general-category open-world 3D part segmentation, we also release a benchmark for general objects and parts. Project website: <a class="link-external link-https" href="https://ziqi-ma.github.io/find3dsite/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve open - world part segmentation in three - dimensional space. Specifically, the goal is to be able to segment any part of any object according to any text query. This is different from traditional closed - world methods, which are usually limited to specific object categories or part vocabularies. The paper proposes a model named FIND3D, which can achieve zero - shot segmentation of a wide range of objects and parts through training on large - scale Internet 3D assets without manual annotation. ### Main Contributions 1. **Zero - shot, part - level, open - world direct prediction model**: FIND3D can work on general object categories and part queries, and compared with existing methods, its mIoU is improved by 3 times and the inference speed is increased by 6 to 300 times. 2. **Data engine**: A data engine has been developed, which can automatically annotate parts of large - scale Internet 3D assets without manual annotation. This data engine creatively combines existing vision and language foundation models. 3. **Benchmark**: A benchmark for evaluating open - world part - level semantic segmentation of general object categories and parts has been released, which contains diverse objects and has no pose limitations. ### Method Overview FIND3D consists of two main parts: 1. **Data engine**: Use 2D foundation models (such as SAM and Gemini) to automatically annotate 3D assets. These annotated data are used to train 3D point cloud models. 2. **Contrastive training method**: Use the contrastive learning objective to deal with part hierarchies and ambiguity problems. Specifically, the model predicts the semantic features of each point by calculating the cosine similarity between each point feature and the text query embedding. ### Experimental Results - **Performance**: FIND3D performs excellently on multiple benchmark datasets, especially on the Objaverse benchmark, its mIoU is 3 times higher than that of the best baseline method PointCLIPV2. - **Generalization ability**: FIND3D not only performs well on seen categories, but also shows strong generalization ability on unseen categories. - **Robustness**: FIND3D has strong robustness to changes such as query text rephrasing and object rotation. - **Efficiency**: The inference speed of FIND3D is far faster than other open - vocabulary baseline methods, being 6 to 300 times faster. ### Conclusion FIND3D successfully achieves open - world 3D part segmentation by combining an automatically annotated data engine and a contrastive training method. The model outperforms existing methods on multiple benchmark datasets and has high robustness and efficiency in practical applications.

Find Any Part in 3D

Search3D: Hierarchical Open-Vocabulary 3D Segmentation

SAMPart3D: Segment Any Part in 3D Objects

Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

PartNet: A Large-scale Benchmark for Fine-grained and Hierarchical Part-level 3D Object Understanding

SAI3D: Segment Any Instance in 3D Scenes

3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

3D Part Segmentation via Geometric Aggregation of 2D Visual Features

Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments

Part2Object: Hierarchical Unsupervised 3D Instance Segmentation

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

OV-PARTS: Towards Open-Vocabulary Part Segmentation

Attentional Keypoint Detection on Point Clouds for 3D Object Part Segmentation

Part123: Part-aware 3D Reconstruction from a Single-view Image

Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

OpenSU3D: Open World 3D Scene Understanding using Foundation Models

SA3DIP: Segment Any 3D Instance with Potential 3D Priors

Anything-3D: Towards Single-view Anything Reconstruction in the Wild

Open-Vocabulary High-Resolution 3D (OVHR3D) Data Segmentation and Annotation Framework

Open-set Hierarchical Semantic Segmentation for 3D Scene