Ayca Takmaz,Alexandros Delitzas,Robert W. Sumner,Francis Engelmann,Johanna Wald,Federico Tombari
Abstract:Open-vocabulary 3D segmentation enables the exploration of 3D spaces using free-form text descriptions. Existing methods for open-vocabulary 3D instance segmentation primarily focus on identifying object-level instances in a scene. However, they face challenges when it comes to understanding more fine-grained scene entities such as object parts, or regions described by generic attributes. In this work, we introduce Search3D, an approach that builds a hierarchical open-vocabulary 3D scene representation, enabling the search for entities at varying levels of granularity: fine-grained object parts, entire objects, or regions described by attributes like materials. Our method aims to expand the capabilities of open vocabulary instance-level 3D segmentation by shifting towards a more flexible open-vocabulary 3D search setting less anchored to explicit object-centric queries, compared to prior work. To ensure a systematic evaluation, we also contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan, along with a set of open-vocabulary fine-grained part annotations on ScanNet++. We verify the effectiveness of Search3D across several tasks, demonstrating that our approach outperforms baselines in scene-scale open-vocabulary 3D part segmentation, while maintaining strong performance in segmenting 3D objects and materials.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve fine - grained segmentation of open - vocabulary in 3D scenes. Existing open - vocabulary 3D instance segmentation methods mainly focus on identifying object - level instances in the scene, but face challenges in understanding more fine - grained scene entities (such as object parts or regions described by general attributes). Specifically, the paper points out:
1. **Limitations of existing methods**:
- **Object - level segmentation**: Most existing 3D segmentation methods mainly focus on object - level instance segmentation, but these methods perform poorly when dealing with more fine - grained entities (such as parts of objects).
- **Point - level representation**: Some methods obtain more fine - grained features by constructing per - point feature representations, but these methods have high storage costs, large feature noise, and lack instance information, and require additional post - processing steps to extract 3D masks.
- **Semantic deviation**: Existing 2D feature backbones tend to be biased towards object - level understanding when projected onto 3D geometric representations, and it is difficult to robustly identify object parts and fine - grained elements.
2. **Research objectives**:
- **Multi - granularity segmentation**: The paper proposes a new method - Search3D, which aims to achieve the search and segmentation of entities at different granularity levels, including fine - grained object parts, entire objects, and regions described by attributes, by constructing a hierarchical open - vocabulary 3D scene representation.
- **Flexibility**: Compared with existing methods, Search3D is not limited to explicit object - centered queries, but provides a more flexible open - vocabulary 3D search setting that can handle any user - defined text queries.
3. **Specific problems**:
- **Fine - grained entity identification**: How to identify and segment fine - grained entities in 3D scenes, such as the arms of a chair, the legs of a table, etc.
- **Multi - attribute query**: How to handle queries that span multiple regions, such as "wooden" regions, which may contain multiple different object parts.
- **Flexible query ability**: How to construct an intermediate hierarchical feature representation without knowing the query in advance, so that entities in the scene can be arbitrarily queried during inference.
By solving these problems, the paper hopes to expand the capabilities of open - vocabulary 3D segmentation, enabling it to not only identify long - tailed objects, but also identify object parts and cross - regional attribute queries, thereby providing more powerful support for interactive robot applications.