Abstract:In 3D understanding, point transformers have yielded significant advances in broadening the receptive field. However, further enhancement of the receptive field is hindered by the constraints of grouping attention. The proxy-based model, as a hot topic in image and language feature extraction, uses global or local proxies to expand the model's receptive field. But global proxy-based methods fail to precisely determine proxy positions and are not suited for tasks like segmentation and detection in the point cloud, and exist local proxy-based methods for image face difficulties in global-local balance, proxy sampling in various point clouds, and parallel cross-attention computation for sparse association. In this paper, we present SP$^2$T, a local proxy-based dual stream point transformer, which promotes global receptive field while maintaining a balance between local and global information. To tackle robust 3D proxy sampling, we propose a spatial-wise proxy sampling with vertex-based point proxy associations, ensuring robust point-cloud sampling in many scales of point cloud. To resolve economical association computation, we introduce sparse proxy attention combined with table-based relative bias, which enables low-cost and precise interactions between proxy and point features. Comprehensive experiments across multiple datasets reveal that our model achieves SOTA performance in downstream tasks. The code has been released in <a class="link-external link-https" href="https://github.com/TerenceWallel/Sparse-Proxy-Point-Transformer" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

This paper attempts to solve the bottleneck problems encountered by the Point Transformer model in expanding the receptive field in 3D point cloud understanding. Specifically, existing methods have limitations in expanding the receptive field using the group attention mechanism, especially when dealing with large - scale point cloud data. In addition, although the global proxy method can expand the receptive field of the model, it has deficiencies in accurately determining the proxy position and being applicable to tasks that require local details, such as segmentation and detection. And the existing local proxy methods also face challenges when dealing with point cloud data of different scales, especially in sparse association calculation and parallelized cross - attention calculation. To solve these problems, the paper proposes a new local - proxy dual - stream Point Transformer model - SP2T (Sparse Proxy Attention for Dual - stream Point Transformer). The main contributions of this model include: 1. **Introduction of the SP2T model**: Through the local - proxy dual - stream architecture, the global receptive field is effectively expanded while maintaining the balance between global and local information. 2. **Proposing a space - aware proxy sampling method**: Adopting a vertex - based point - proxy association method to ensure effective sampling and association calculation of proxies in the point cloud. 3. **Introduction of the Sparse Proxy Attention (SPA) mechanism**: Combined with Table - Based Relative Bias, it improves the low - cost and accurate interaction between proxies and point features. These improvements make the experimental results of SP2T reach the state - of - the - art (SOTA) level on multiple datasets and perform excellently in downstream tasks. ### Formula Summary - **Similarity calculation formula of Sparse Proxy Attention (SPA)**: \[ S^h_i=\exp\left(\frac{\text{dot}(q^h_{apt_i}, k^h_{apx_i})}{\sqrt{d}}\right)+\text{TRB}^h(ppt_{apt_i}-ppx_{apx_i}) \] where $ S^h_i $ is the similarity of the $ i $ - th association of the $ h $ - th head, $ q^h_{apt_i} $ and $ k^h_{apx_i} $ are the query and key features respectively, $ d $ is the feature dimension, and $ \text{TRB}^h $ is the Table - Based Relative Bias. - **Formula of Table - Based Relative Bias (TRB)**: \[ \text{TRB}(x)=\text{TGS}(\text{Trpe},\text{clamp}(s_{rpe}x, - 1,1)) \] where $ \text{TGS} $ is the trilinear interpolation function, $ x $ is the distance between the proxy and the point, $ s_{rpe} $ is the scaling factor, and $ \text{clamp} $ is the function that restricts the numerical range. Through these methods, SP2T not only solves the limitations of existing methods in receptive field expansion but also improves the performance of the model in 3D point cloud understanding tasks.

SP$^2$T: Sparse Proxy Attention for Dual-stream Point Transformer

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

3D Object Segmentation Using Cross-Window Point Transformer with Latent Semantic Boundary Guidance

Soft Masked Transformer for Point Cloud Processing with Skip Attention-Based Upsampling

3DPCT: 3D Point Cloud Transformer with Dual Self-attention

CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning

Learning Cross-Attention Point Transformer With Global Porous Sampling

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding

PSSD-Transformer: Powerful Sparse Spike-Driven Transformer for Image Semantic Segmentation

Stratified Transformer for 3D Point Cloud Segmentation

Local Transformer Network on 3D Point Cloud Semantic Segmentation

MPCT: Multiscale Point Cloud Transformer with a Residual Network

Point Transformer V3: Simpler, Faster, Stronger

PU-Transformer: Point Cloud Upsampling Transformer

ProxyFormer: Proxy Alignment Assisted Point Cloud Completion with Missing Part Sensitive Transformer

Sparse 3D Point Cloud Parallel Multi-Scale Feature Extraction and Dense Reconstruction with Multi-Headed Attentional Upsampling

Few-Shot 3D Point Cloud Semantic Segmentation via Stratified Class-Specific Attention Based Transformer Network

PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture

PointNAT: Large-Scale Point Cloud Semantic Segmentation via Neighbor Aggregation With Transformer

D2T-Net: A dual-domain transformer network exploiting spatial and channel dimensions for semantic segmentation of urban mobile laser scanning point clouds