Find Any Part in 3D

Ziqi Ma,Yisong Yue,Georgia Gkioxari
2024-11-21
Abstract:We study open-world part segmentation in 3D: segmenting any part in any object based on any text query. Prior methods are limited in object categories and part vocabularies. Recent advances in AI have demonstrated effective open-world recognition capabilities in 2D. Inspired by this progress, we propose an open-world, direct-prediction model for 3D part segmentation that can be applied zero-shot to any object. Our approach, called Find3D, trains a general-category point embedding model on large-scale 3D assets from the internet without any human annotation. It combines a data engine, powered by foundation models for annotating data, with a contrastive training method. We achieve strong performance and generalization across multiple datasets, with up to a 3x improvement in mIoU over the next best method. Our model is 6x to over 300x faster than existing baselines. To encourage research in general-category open-world 3D part segmentation, we also release a benchmark for general objects and parts. Project website: <a class="link-external link-https" href="https://ziqi-ma.github.io/find3dsite/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve open - world part segmentation in three - dimensional space. Specifically, the goal is to be able to segment any part of any object according to any text query. This is different from traditional closed - world methods, which are usually limited to specific object categories or part vocabularies. The paper proposes a model named FIND3D, which can achieve zero - shot segmentation of a wide range of objects and parts through training on large - scale Internet 3D assets without manual annotation. ### Main Contributions 1. **Zero - shot, part - level, open - world direct prediction model**: FIND3D can work on general object categories and part queries, and compared with existing methods, its mIoU is improved by 3 times and the inference speed is increased by 6 to 300 times. 2. **Data engine**: A data engine has been developed, which can automatically annotate parts of large - scale Internet 3D assets without manual annotation. This data engine creatively combines existing vision and language foundation models. 3. **Benchmark**: A benchmark for evaluating open - world part - level semantic segmentation of general object categories and parts has been released, which contains diverse objects and has no pose limitations. ### Method Overview FIND3D consists of two main parts: 1. **Data engine**: Use 2D foundation models (such as SAM and Gemini) to automatically annotate 3D assets. These annotated data are used to train 3D point cloud models. 2. **Contrastive training method**: Use the contrastive learning objective to deal with part hierarchies and ambiguity problems. Specifically, the model predicts the semantic features of each point by calculating the cosine similarity between each point feature and the text query embedding. ### Experimental Results - **Performance**: FIND3D performs excellently on multiple benchmark datasets, especially on the Objaverse benchmark, its mIoU is 3 times higher than that of the best baseline method PointCLIPV2. - **Generalization ability**: FIND3D not only performs well on seen categories, but also shows strong generalization ability on unseen categories. - **Robustness**: FIND3D has strong robustness to changes such as query text rephrasing and object rotation. - **Efficiency**: The inference speed of FIND3D is far faster than other open - vocabulary baseline methods, being 6 to 300 times faster. ### Conclusion FIND3D successfully achieves open - world 3D part segmentation by combining an automatically annotated data engine and a contrastive training method. The model outperforms existing methods on multiple benchmark datasets and has high robustness and efficiency in practical applications.