Abstract:Existing 3D instance segmentation methods frequently encounter issues with over-segmentation, leading to redundant and inaccurate 3D proposals that complicate downstream tasks. This challenge arises from their unsupervised merging approach, where dense 2D instance masks are lifted across frames into point clouds to form 3D candidate proposals without direct supervision. These candidates are then hierarchically merged based on heuristic criteria, often resulting in numerous redundant segments that fail to combine into precise 3D proposals. To overcome these limitations, we propose a 3D-Aware 2D Mask Tracking module that uses robust 3D priors from a 2D mask segmentation and tracking foundation model (SAM-2) to ensure consistent object masks across video frames. Rather than merging all visible superpoints across views to create a 3D mask, our 3D Mask Optimization module leverages a dynamic programming algorithm to select an optimal set of views, refining the superpoints to produce a final 3D proposal for each object. Our approach achieves comprehensive object coverage within the scene while reducing unnecessary proposals, which could otherwise impair downstream applications. Evaluations on ScanNet200 and ScanNet++ confirm the effectiveness of our method, with improvements across Class-Agnostic, Open-Vocabulary, and Open-Ended 3D Instance Segmentation tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the over - segmentation problem encountered by existing 3D instance segmentation methods when dealing with unseen object categories. Specifically, current methods usually rely on an unsupervised merging process, which leads to the generation of redundant and inaccurate 3D proposals, thus affecting the performance of downstream tasks. These problems mainly stem from the following points: 1. **Over - segmentation problem**: Existing 3D instance segmentation methods often lack direct supervision when lifting dense 2D instance masks to point clouds to form 3D candidate proposals. These candidate proposals are then hierarchically merged according to heuristic criteria, resulting in a large number of redundant fragments that cannot be combined into accurate 3D proposals. 2. **Multi - view consistency problem**: Due to the inconsistent prediction of 2D segmentation masks in different views, 3D scene segmentation is prone to over - segmentation. 3. **High computational cost**: The lack of guiding merging criteria means that all possible mask combinations must be retained to ensure safety, which increases the computational cost. To solve the above problems, the paper proposes a new method - Any3DIS (Class - Agnostic 3D Instance Segmentation by 2D Mask Tracking), and its main innovations include: - **3D - Aware 2D Mask Tracking module**: Utilize strong 3D cues from 2D basic mask tracking models (such as SAM - 2) to ensure the consistency of object - level masks between video frames. - **3D Proposal Refinement via Optimization module**: Select the optimal view set through the dynamic programming algorithm, optimize super - points to generate the final 3D proposal, and significantly reduce redundant 3D proposals. Through these improvements, Any3DIS can improve the segmentation accuracy while reducing redundant proposals, and has achieved significant performance improvements on the ScanNet200 and ScanNet++ datasets. Experimental results show that this method performs well in class - agnostic, open - vocabulary, and open - ended 3D instance segmentation tasks. ### Formula Summary 1. **Projection function**: \[ \rho_l^t=\Pi(S_l, K, E_t, D_t) \] where \(\rho_l^t\) is the set of 2D projection points of the \(l\) - th super - point in the \(t\) - th frame, \(\Pi\) is the projection function, \(S_l\) is the super - point, \(K\) is the camera internal parameters, \(E_t\) is the external parameters, and \(D_t\) is the depth map. 2. **Scaling factor calculation**: \[ s_l^t = \frac{\sum_{k\in KNN(S_l)}|\rho_k^t|/N_{Sk}}{\kappa} \] where \(\rho_k^t\) is the number of projection points of the neighboring super - point \(k\) in the \(t\) - th frame, \(N_{Sk}\) is the number of 3D points of the super - point \(k\), and \(\kappa\) is the number of neighboring super - points. 3. **Final projection point histogram**: \[ \psi_l^t = |\rho_l^t|\cdot s_l^t \] 4. **Binary visibility vector**: \[ M_{t, l}^q=\begin{cases}1, & \text{if } \text{IOU}(\rho_l^t, m_t^q)\geq\tau\\0, & \text{otherwise}\end{cases} \] 5. **Optimization objective function**: \[ L(\theta)=\max_\theta\sum_t\left(\rho_\theta^t\otimes m_t^q-\rho_\theta^t\otimes(1^{H\times W}-m_t^q)\right)

Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance

Pass3d: Precise And Accelerated Semantic Segmentation For 3d Point Cloud

SA3DIP: Segment Any 3D Instance with Potential 3D Priors

Open-Ended 3D Point Cloud Instance Segmentation

SAI3D: Segment Any Instance in 3D Scenes

SAM3D: Segment Anything in 3D Scenes

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

3D Object Segmentation Using Cross-Window Point Transformer with Latent Semantic Boundary Guidance

ISBNet: a 3D Point Cloud Instance Segmentation Network with Instance-aware Sampling and Box-aware Dynamic Convolution

MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation

SAM-guided Graph Cut for 3D Instance Segmentation

OpenMask3D: Open-Vocabulary 3D Instance Segmentation

OccuSeg: Occupancy-Aware 3D Instance Segmentation

Hierarchical Aggregation for 3D Instance Segmentation

MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation

PointInst3D: Segmenting 3D Instances by Points

Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation

SAM2Point: Segment Any 3D as Videos in Zero-shot and Promptable Manners

MSTA3D: Multi-scale Twin-attention for 3D Instance Segmentation

Part2Object: Hierarchical Unsupervised 3D Instance Segmentation