MSTA3D: Multi-scale Twin-attention for 3D Instance Segmentation

Duc Dang Trung Tran,Byeongkeun Kang,Yeejin Lee
DOI: https://doi.org/10.1145/3664647.3680667
2024-11-05
Abstract:Recently, transformer-based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over-segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi-scale feature representation and introduces a twin-attention mechanism to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200 and S3DIS datasets demonstrate that our approach surpasses state-of-the-art 3D instance segmentation methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem Addressed by the Paper This paper aims to address the issue of over-segmentation in 3D instance segmentation, which is particularly evident when dealing with large objects such as doors, curtains, bookshelves, and backgrounds. Additionally, existing superpoint-based methods have reliability issues when predicting masks. These problems are mainly reflected in the following aspects: 1. **Over-segmentation Issue**: Existing methods tend to over-segment large objects, meaning a single object is incorrectly divided into multiple parts. 2. **Unreliability of Mask Prediction**: During the label conversion from superpoints to points, category grouping may introduce unreliability. 3. **Sparsity and Irregularity of Point Clouds**: Point cloud data usually lacks clear structure, unlike the regular grid arrangement in images, making instance prediction from point clouds more challenging. To address these challenges, the authors propose a new framework—MSTA3D (Multi-scale Twin-attention for 3D Instance Segmentation). This framework effectively captures features at different scales through multi-scale feature representation and twin-attention mechanisms, and provides complementary spatial constraints through box queries and box regularizers, thereby improving the accuracy of mask prediction. ### Main Contributions 1. **Twin-attention Decoder**: A twin-attention-based decoder is proposed, which can effectively represent multi-scale features and address the over-segmentation issue of large objects and backgrounds. 2. **Box Queries and Box Regularizers**: The concept of box queries and box regularizers is introduced, providing supplementary supervision without additional annotations, enforcing spatial constraints on each instance during the query learning process, thereby enhancing object localization and reducing background noise. 3. **Experimental Validation**: Extensive experiments were conducted on widely used benchmark datasets (such as ScanNetV2, ScanNet200, and S3DIS), demonstrating the effectiveness of the proposed method and achieving state-of-the-art performance. ### Method Overview The MSTA3D framework mainly includes three key components: 1. **Backbone Network**: Extracts multi-scale features. 2. **Twin-attention Decoder**: Generates instance proposals. 3. **Box Regularizer**: Constrains instance regions. Through the collaborative work of these components, MSTA3D can effectively address the over-segmentation issue in 3D instance segmentation and improve the accuracy of mask prediction.