SP$^2$T: Sparse Proxy Attention for Dual-stream Point Transformer

Jiaxu Wan,Hong Zhang,Ziqi He,Qishu Wang,Ding Yuan,Yifan Yang
2024-12-16
Abstract:In 3D understanding, point transformers have yielded significant advances in broadening the receptive field. However, further enhancement of the receptive field is hindered by the constraints of grouping attention. The proxy-based model, as a hot topic in image and language feature extraction, uses global or local proxies to expand the model's receptive field. But global proxy-based methods fail to precisely determine proxy positions and are not suited for tasks like segmentation and detection in the point cloud, and exist local proxy-based methods for image face difficulties in global-local balance, proxy sampling in various point clouds, and parallel cross-attention computation for sparse association. In this paper, we present SP$^2$T, a local proxy-based dual stream point transformer, which promotes global receptive field while maintaining a balance between local and global information. To tackle robust 3D proxy sampling, we propose a spatial-wise proxy sampling with vertex-based point proxy associations, ensuring robust point-cloud sampling in many scales of point cloud. To resolve economical association computation, we introduce sparse proxy attention combined with table-based relative bias, which enables low-cost and precise interactions between proxy and point features. Comprehensive experiments across multiple datasets reveal that our model achieves SOTA performance in downstream tasks. The code has been released in <a class="link-external link-https" href="https://github.com/TerenceWallel/Sparse-Proxy-Point-Transformer" rel="external noopener nofollow">this https URL</a> .
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the bottleneck problems encountered by the Point Transformer model in expanding the receptive field in 3D point cloud understanding. Specifically, existing methods have limitations in expanding the receptive field using the group attention mechanism, especially when dealing with large - scale point cloud data. In addition, although the global proxy method can expand the receptive field of the model, it has deficiencies in accurately determining the proxy position and being applicable to tasks that require local details, such as segmentation and detection. And the existing local proxy methods also face challenges when dealing with point cloud data of different scales, especially in sparse association calculation and parallelized cross - attention calculation. To solve these problems, the paper proposes a new local - proxy dual - stream Point Transformer model - SP2T (Sparse Proxy Attention for Dual - stream Point Transformer). The main contributions of this model include: 1. **Introduction of the SP2T model**: Through the local - proxy dual - stream architecture, the global receptive field is effectively expanded while maintaining the balance between global and local information. 2. **Proposing a space - aware proxy sampling method**: Adopting a vertex - based point - proxy association method to ensure effective sampling and association calculation of proxies in the point cloud. 3. **Introduction of the Sparse Proxy Attention (SPA) mechanism**: Combined with Table - Based Relative Bias, it improves the low - cost and accurate interaction between proxies and point features. These improvements make the experimental results of SP2T reach the state - of - the - art (SOTA) level on multiple datasets and perform excellently in downstream tasks. ### Formula Summary - **Similarity calculation formula of Sparse Proxy Attention (SPA)**: \[ S^h_i=\exp\left(\frac{\text{dot}(q^h_{apt_i}, k^h_{apx_i})}{\sqrt{d}}\right)+\text{TRB}^h(ppt_{apt_i}-ppx_{apx_i}) \] where \( S^h_i \) is the similarity of the \( i \) - th association of the \( h \) - th head, \( q^h_{apt_i} \) and \( k^h_{apx_i} \) are the query and key features respectively, \( d \) is the feature dimension, and \( \text{TRB}^h \) is the Table - Based Relative Bias. - **Formula of Table - Based Relative Bias (TRB)**: \[ \text{TRB}(x)=\text{TGS}(\text{Trpe},\text{clamp}(s_{rpe}x, - 1,1)) \] where \( \text{TGS} \) is the trilinear interpolation function, \( x \) is the distance between the proxy and the point, \( s_{rpe} \) is the scaling factor, and \( \text{clamp} \) is the function that restricts the numerical range. Through these methods, SP2T not only solves the limitations of existing methods in receptive field expansion but also improves the performance of the model in 3D point cloud understanding tasks.