SAI3D: Segment Any Instance in 3D Scenes

Yingda Yin,Yuzheng Liu,Yang Xiao,Daniel Cohen-Or,Jingwei Huang,Baoquan Chen
2024-03-24
Abstract:Advancements in 3D instance segmentation have traditionally been tethered to the availability of annotated datasets, limiting their application to a narrow spectrum of object categories. Recent efforts have sought to harness vision-language models like CLIP for open-set semantic reasoning, yet these methods struggle to distinguish between objects of the same categories and rely on specific prompts that are not universally applicable. In this paper, we introduce SAI3D, a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Anything Model (SAM). Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations that are consistent with the multi-view SAM masks. Moreover, we design a hierarchical region-growing algorithm with a dynamic thresholding mechanism, which largely improves the robustness of finegrained 3D scene parsing.Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach. Notably, SAI3D outperforms existing open-vocabulary baselines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++. Our project page is at <a class="link-external link-https" href="https://yd-yin.github.io/SAI3D" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper proposes a solution to the problem of instance segmentation in 3D scenes, especially in cases where there is no 3D annotation data. Current methods rely on annotated data, limiting their applicability to new object categories. SAI3D combines geometric priors and a 2D segmentation model (SAM) to achieve unsupervised 3D instance segmentation. By segmenting the 3D scene into geometric elements and progressively merging them, while considering multi-view consistency, it improves the robustness of fine-grained 3D scene parsing. Experiments show that SAI3D outperforms baseline and supervised methods on multiple datasets.