Phuc D.A. Nguyen,Tuan Duc Ngo,Evangelos Kalogerakis,Chuang Gan,Anh Tran,Cuong Pham,Khoi Nguyen
Abstract:We introduce Open3DIS, a novel solution designed to tackle the problem of Open-Vocabulary Instance Segmentation within 3D scenes. Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task. Recent advancements in Open-Vocabulary scene understanding have made significant strides in this area by employing class-agnostic 3D instance proposal networks for object localization and learning queryable features for each 3D mask. While these methods produce high-quality instance proposals, they struggle with identifying small-scale and geometrically ambiguous objects. The key idea of our method is a new module that aggregates 2D instance masks across frames and maps them to geometrically coherent point cloud regions as high-quality object proposals addressing the above limitations. These are then combined with 3D class-agnostic instance proposals to include a wide range of objects in the real world. To validate our approach, we conducted experiments on three prominent datasets, including ScanNet200, S3DIS, and Replica, demonstrating significant performance gains in segmenting objects with diverse categories over the state-of-the-art approaches.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is Open - Vocabulary 3D Instance Segmentation (OV - 3DIS). Specifically, the authors aim to obtain binary instance masks of arbitrary categories from 3D scenes, and these categories may not exist during the training phase. The challenges of this problem are as follows:
1. **Limitations of traditional methods**: Existing fully - supervised 3D instance segmentation methods are limited by the closed - set framework and can only recognize predefined object categories.
2. **Difficulties in recognizing small - scale and geometrically ambiguous objects**: Existing methods have difficulty in recognizing small - scale or geometrically unclear objects, especially those from rare categories.
3. **Challenges in 2D - to - 3D mapping**: When mapping instance masks in 2D images to 3D point clouds, background points or other irrelevant regions may be included, resulting in poor - quality 3D proposals.
To solve these problems, the paper proposes Open3DIS, a new open - vocabulary 3D instance segmentation method. Its main contributions include:
1. **2D - Guided 3D Proposal Module**: By aggregating 2D instance masks in multi - view RGB - D images and mapping them to geometrically coherent point cloud regions, high - quality 3D object proposals are generated.
2. **Pointwise Feature Extraction**: A new point - by - point feature extraction method for open - vocabulary 3D object proposals is introduced.
3. **Excellent performance**: Experiments on three datasets, ScanNet200, S3DIS, and Replica, show that Open3DIS significantly outperforms existing methods in the OV - 3DIS task, especially when dealing with rare categories.
### Formula presentation
To ensure the correctness and readability of the formulas, the following are some key formulas involved in the paper:
- **Calculation of point cloud feature similarity**:
\[
s'_{i,j}=\cos(f^{3D}_i,f^{3D}_j)
\]
where \(f^{3D}_i\) and \(f^{3D}_j\) are the feature vectors of two point cloud regions respectively, obtained by the average of their point features.
- **Calculation of the matching score matrix**:
\[
c_{i,j}=1(o'_{i,j}>\tau_{iou})\odot1(s'_{i,j}>\tau_{sim})
\]
where \(o'_{i,j}=\text{IoU}(r_i,r_j)\) represents the intersection - over - union ratio of two regions, \(\tau_{iou}\) and \(\tau_{sim}\) are thresholds, \(1(\cdot)\) is an indicator function, and \(\odot\) is a logical AND operator.
- **CLIP feature calculation**:
\[
F^{\text{CLIP}} = NV\left(\sum_k\sum_\lambda(\nu_\lambda\ast f^{\text{CLIP}}_{\lambda,k})\ast m^{3D}_k\right)
\]
where \(f^{\text{CLIP}}_{\lambda,k}\) is the 2D CLIP image feature of the \(k\) - th instance in the \(\lambda\) - th view, \(\nu_\lambda\) is the visibility map of view \(\lambda\), \(m^{3D}_k\) is the binary mask of the \(k\) - th proposal, and \(NV(x)\) is the L2 - normalized vector of \(x\).
- **Final score between text query and 3D mask**:
\[
s^{\text{CLIP}}_{k,\rho}=\frac{1}{|m^{3D}_k|}\sum_n\cos(F^{\text{CLIP}}\ast m^{3D}_k,e_\rho)
\]