Vocabulary-Free 3D Instance Segmentation with Vision and Language Assistant

Guofeng Mei,Luigi Riz,Yiming Wang,Fabio Poiesi
2024-08-20
Abstract:Most recent 3D instance segmentation methods are open vocabulary, offering a greater flexibility than closed-vocabulary methods. Yet, they are limited to reasoning within a specific set of concepts, \ie the vocabulary, prompted by the user at test time. In essence, these models cannot reason in an open-ended fashion, i.e., answering ``List the objects in the scene.''. We introduce the first method to address 3D instance segmentation in a setting that is void of any vocabulary prior, namely a vocabulary-free setting. We leverage a large vision-language assistant and an open-vocabulary 2D instance segmenter to discover and ground semantic categories on the posed images. To form 3D instance mask, we first partition the input point cloud into dense superpoints, which are then merged into 3D instance masks. We propose a novel superpoint merging strategy via spectral clustering, accounting for both mask coherence and semantic coherence that are estimated from the 2D object instance masks. We evaluate our method using ScanNet200 and Replica, outperforming existing methods in both vocabulary-free and open-vocabulary settings. Code will be made available.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to get rid of the dependence on the predefined vocabulary prior in 3D instance segmentation tasks and achieve 3D instance segmentation in a fully open - vocabulary setting without any preset category information. Specifically: - **Limitations of existing methods**: Most existing 3D instance segmentation methods are open - vocabulary, which means that they can identify objects beyond the category range that appears in the training set at test time, but are still limited to the specific vocabulary provided by the user at test time. These methods cannot truly answer questions such as "What objects are there in the scene?" in an open - ended way. - **Proposed new problem setting**: This paper proposes a new task setting - Vocabulary - Free 3D Instance Segmentation (VoF3DIS). In this setting, the model needs to segment all object instances in the point cloud and assign semantic labels without any predefined vocabulary. This requires the model to be able to independently identify and understand various objects in the scene during inference without relying on a user - specified vocabulary or query. - **Challenges and significance**: This vocabulary - free setting is particularly important in dynamically changing scenes, such as in assistive robotics applications, where the objects in the scene may be unknown or undefined. Therefore, developing methods that can handle such scenes has important practical significance. ### Specific objectives 1. **Introduce a new task**: Formally define and introduce the vocabulary - free 3D instance segmentation task (VoF3DIS), that is, segment object instances in 3D point clouds and assign semantic labels without a predefined vocabulary. 2. **Propose a new method**: Design a new method named PoV o. This method uses a vision - language assistant and a 2D instance segmentation model to identify and locate objects in the scene and lift this information to 3D space, thereby generating 3D instance masks. 3. **Innovations**: - PoV o is zero - shot learning and does not need to be trained on 2D or 3D data. - Propose a new super - point merging strategy. Through spectral clustering, consider both mask consistency and semantic consistency to form the final 3D instance masks. - Be able to achieve better results than existing methods on datasets such as ScanNet200 and Replica, not only performing excellently in the vocabulary - free setting but also in the open - vocabulary setting. ### Solution overview To achieve the above objectives, PoV o adopts the following steps: 1. **Scene vocabulary generation**: Extract an object list from multi - view images through a vision - language assistant and verify it using an open - vocabulary 2D instance segmentation model to form a scene vocabulary. 2. **Super - point generation and merging**: Divide the 3D scene into geometrically consistent super - points and merge these super - points into 3D instance masks through spectral clustering, while considering the consistency of 2D instance masks and semantic consistency. 3. **Text - aligned point representation**: Generate a text - aligned feature representation for the points in each 3D instance mask for final semantic classification. Through this method, PoV o can accurately identify and segment object instances in 3D scenes without a predefined vocabulary.