Abstract:Recently, transformer-based techniques incorporating superpoints have become prevalent in 3D instance segmentation. However, they often encounter an over-segmentation problem, especially noticeable with large objects. Additionally, unreliable mask predictions stemming from superpoint mask prediction further compound this issue. To address these challenges, we propose a novel framework called MSTA3D. It leverages multi-scale feature representation and introduces a twin-attention mechanism to effectively capture them. Furthermore, MSTA3D integrates a box query with a box regularizer, offering a complementary spatial constraint alongside semantic queries. Experimental evaluations on ScanNetV2, ScanNet200 and S3DIS datasets demonstrate that our approach surpasses state-of-the-art 3D instance segmentation methods.

What problem does this paper attempt to address?

### The Problem Addressed by the Paper This paper aims to address the issue of over-segmentation in 3D instance segmentation, which is particularly evident when dealing with large objects such as doors, curtains, bookshelves, and backgrounds. Additionally, existing superpoint-based methods have reliability issues when predicting masks. These problems are mainly reflected in the following aspects: 1. **Over-segmentation Issue**: Existing methods tend to over-segment large objects, meaning a single object is incorrectly divided into multiple parts. 2. **Unreliability of Mask Prediction**: During the label conversion from superpoints to points, category grouping may introduce unreliability. 3. **Sparsity and Irregularity of Point Clouds**: Point cloud data usually lacks clear structure, unlike the regular grid arrangement in images, making instance prediction from point clouds more challenging. To address these challenges, the authors propose a new framework—MSTA3D (Multi-scale Twin-attention for 3D Instance Segmentation). This framework effectively captures features at different scales through multi-scale feature representation and twin-attention mechanisms, and provides complementary spatial constraints through box queries and box regularizers, thereby improving the accuracy of mask prediction. ### Main Contributions 1. **Twin-attention Decoder**: A twin-attention-based decoder is proposed, which can effectively represent multi-scale features and address the over-segmentation issue of large objects and backgrounds. 2. **Box Queries and Box Regularizers**: The concept of box queries and box regularizers is introduced, providing supplementary supervision without additional annotations, enforcing spatial constraints on each instance during the query learning process, thereby enhancing object localization and reducing background noise. 3. **Experimental Validation**: Extensive experiments were conducted on widely used benchmark datasets (such as ScanNetV2, ScanNet200, and S3DIS), demonstrating the effectiveness of the proposed method and achieving state-of-the-art performance. ### Method Overview The MSTA3D framework mainly includes three key components: 1. **Backbone Network**: Extracts multi-scale features. 2. **Twin-attention Decoder**: Generates instance proposals. 3. **Box Regularizer**: Constrains instance regions. Through the collaborative work of these components, MSTA3D can effectively address the over-segmentation issue in 3D instance segmentation and improve the accuracy of mask prediction.

MSTA3D: Multi-scale Twin-attention for 3D Instance Segmentation

3D Object Segmentation Using Cross-Window Point Transformer with Latent Semantic Boundary Guidance

TSPconv-Net: Transformer and Sparse Convolution for 3D Instance Segmentation in Point Clouds

SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation

Any3DIS: Class-Agnostic 3D Instance Segmentation by 2D Mask Tracking

Multi-Source Features Fusion Single Stage 3D Object Detection with Transformer.

Learning Inter-Superpoint Affinity for Weakly Supervised 3D Instance Segmentation

STA-Former: enhancing medical image segmentation with Shrinkage Triplet Attention in a hybrid CNN-Transformer model

CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning

TT-Net: Tensorized Transformer Network for 3D medical image segmentation

MTD-MVSNet: Multi-view Stereo Network with Multi-scale Transformer and Dual Attention

Three-Dimensional Instance Segmentation Using the Generalized Hough Transform and the Adaptive n-Shifted Shuffle Attention

MS-TCNet: An effective Transformer–CNN combined network using multi-scale feature learning for 3D medical image segmentation

D2T-Net: A dual-domain transformer network exploiting spatial and channel dimensions for semantic segmentation of urban mobile laser scanning point clouds

Toward High Quality Multi-Object Tracking and Segmentation Without Mask Supervision

S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection

An efficient point cloud semantic segmentation network with multiscale super-patch transformer

OccuSeg: Occupancy-Aware 3D Instance Segmentation

SA3DIP: Segment Any 3D Instance with Potential 3D Priors

BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation