Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

Phuc Nguyen,Minh Luu,Anh Tran,Cuong Pham,Khoi Nguyen
2024-11-25
Abstract:Existing 3D instance segmentation methods frequently encounter issues with over-segmentation, leading to redundant and inaccurate 3D proposals that complicate downstream tasks. This challenge arises from their unsupervised merging approach, where dense 2D instance masks are lifted across frames into point clouds to form 3D candidate proposals without direct supervision. These candidates are then hierarchically merged based on heuristic criteria, often resulting in numerous redundant segments that fail to combine into precise 3D proposals. To overcome these limitations, we propose a 3D-Aware 2D Mask Tracking module that uses robust 3D priors from a 2D mask segmentation and tracking foundation model (SAM-2) to ensure consistent object masks across video frames. Rather than merging all visible superpoints across views to create a 3D mask, our 3D Mask Optimization module leverages a dynamic programming algorithm to select an optimal set of views, refining the superpoints to produce a final 3D proposal for each object. Our approach achieves comprehensive object coverage within the scene while reducing unnecessary proposals, which could otherwise impair downstream applications. Evaluations on ScanNet200 and ScanNet++ confirm the effectiveness of our method, with improvements across Class-Agnostic, Open-Vocabulary, and Open-Ended 3D Instance Segmentation tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the over - segmentation problem encountered by existing 3D instance segmentation methods when dealing with unseen object categories. Specifically, current methods usually rely on an unsupervised merging process, which leads to the generation of redundant and inaccurate 3D proposals, thus affecting the performance of downstream tasks. These problems mainly stem from the following points: 1. **Over - segmentation problem**: Existing 3D instance segmentation methods often lack direct supervision when lifting dense 2D instance masks to point clouds to form 3D candidate proposals. These candidate proposals are then hierarchically merged according to heuristic criteria, resulting in a large number of redundant fragments that cannot be combined into accurate 3D proposals. 2. **Multi - view consistency problem**: Due to the inconsistent prediction of 2D segmentation masks in different views, 3D scene segmentation is prone to over - segmentation. 3. **High computational cost**: The lack of guiding merging criteria means that all possible mask combinations must be retained to ensure safety, which increases the computational cost. To solve the above problems, the paper proposes a new method - Any3DIS (Class - Agnostic 3D Instance Segmentation by 2D Mask Tracking), and its main innovations include: - **3D - Aware 2D Mask Tracking module**: Utilize strong 3D cues from 2D basic mask tracking models (such as SAM - 2) to ensure the consistency of object - level masks between video frames. - **3D Proposal Refinement via Optimization module**: Select the optimal view set through the dynamic programming algorithm, optimize super - points to generate the final 3D proposal, and significantly reduce redundant 3D proposals. Through these improvements, Any3DIS can improve the segmentation accuracy while reducing redundant proposals, and has achieved significant performance improvements on the ScanNet200 and ScanNet++ datasets. Experimental results show that this method performs well in class - agnostic, open - vocabulary, and open - ended 3D instance segmentation tasks. ### Formula Summary 1. **Projection function**: \[ \rho_l^t=\Pi(S_l, K, E_t, D_t) \] where \(\rho_l^t\) is the set of 2D projection points of the \(l\) - th super - point in the \(t\) - th frame, \(\Pi\) is the projection function, \(S_l\) is the super - point, \(K\) is the camera internal parameters, \(E_t\) is the external parameters, and \(D_t\) is the depth map. 2. **Scaling factor calculation**: \[ s_l^t = \frac{\sum_{k\in KNN(S_l)}|\rho_k^t|/N_{Sk}}{\kappa} \] where \(\rho_k^t\) is the number of projection points of the neighboring super - point \(k\) in the \(t\) - th frame, \(N_{Sk}\) is the number of 3D points of the super - point \(k\), and \(\kappa\) is the number of neighboring super - points. 3. **Final projection point histogram**: \[ \psi_l^t = |\rho_l^t|\cdot s_l^t \] 4. **Binary visibility vector**: \[ M_{t, l}^q=\begin{cases}1, & \text{if } \text{IOU}(\rho_l^t, m_t^q)\geq\tau\\0, & \text{otherwise}\end{cases} \] 5. **Optimization objective function**: \[ L(\theta)=\max_\theta\sum_t\left(\rho_\theta^t\otimes m_t^q-\rho_\theta^t\otimes(1^{H\times W}-m_t^q)\right)