Abstract:Transformer-based models have significantly advanced natural language processing and computer vision in recent years. However, due to the irregular and disordered structure of point cloud data, transformer-based models for 3D deep learning are still in their infancy compared to other methods. In this paper we present Point Cross-Attention Transformer (PointCAT), a novel end-to-end network architecture using cross-attentions mechanism for point cloud representing. Our approach combines multi-scale features via two seprate cross-attention transformer branches. To reduce the computational increase brought by multi-branch structure, we further introduce an efficient model for shape classification, which only process single class token of one branch as a query to calculate attention map with the other. Extensive experiments demonstrate that our method outperforms or achieves comparable performance to several approaches in shape classification, part segmentation and semantic segmentation tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in point cloud data processing, existing Transformer - based models are difficult to effectively capture long - distance dependencies and multi - scale features due to the irregularity and disorder of point cloud data. Specifically, the paper proposes a new dual - branch cross - attention Transformer network architecture (PointCAT), aiming to enhance the point cloud representation ability by combining position and content features, so as to achieve better performance in shape classification, part segmentation and semantic segmentation tasks. ### Main contributions of the paper 1. **Efficient multi - scale feature extraction**: Developed an efficient hierarchical structure for extracting multi - scale feature representations in 3D understanding. This method can learn accurate position information while reducing computational complexity. 2. **Dual - branch cross - attention Transformer architecture**: Proposed a new dual - branch cross - attention Transformer architecture (PointCAT) that can fully combine position and content features at different levels. Due to the permutation invariance of Transformer, this architecture is particularly suitable for point cloud learning. 3. **Experimental verification**: Extensive experiments show that this method performs better than or is comparable to existing methods in multiple fields and datasets. ### Technical details of the paper 1. **Multi - scale grouping module**: - Obtain sample point groups through farthest point sampling (FPS) and K - nearest neighbor search (KNN). - Encode regional details through local aggregation operations and optimize the grouping process through linear geometric shift parameters. - Formula representation: \[ F_g=\phi\left(\frac{K(P, F_s)-F_s}{\sigma+\epsilon}\right) \] where \( K \) represents the K - nearest neighbor algorithm, \( \sigma \) is the standard deviation of the channel, \( \epsilon = 1\times10^{-5} \) to prevent division by zero, and \( \phi \) represents the linear geometric shift parameter. 2. **Token embedding**: - Similar to the [class] token in ViT, add a learnable embedding [xcls] as a point cloud representation token. - Formula representation: \[ X = \text{Concat}([x_{\text{cls}}], F)=\{x_{\text{cls}}, x_i|i = 1,\ldots,n\}\in\mathbb{R}^{(N + 1)\times2d} \] 3. **Cross - attention layer**: - Achieve feature dimension fusion through the multi - head self - attention mechanism (MSA) and process the features through linear projection and layer normalization (LN). - Formula representation: \[ Q = W_q\cdot X_l'_{\text{cls}}, \quad K = W_k\cdot X_l, \quad V = W_v\cdot X_l \] \[ \text{MSA}(Q, K, V)=\text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \] 4. **Implementation details**: - **Shape classification**: Perform classification prediction through multi - scale global representation and MLP. - **Part segmentation**: Segment by propagating features layer by layer and combining global features. - **Semantic segmentation**: Utilize the color information of point clouds as an additional modal feature. ### Experimental results - **Shape classification**: On the ModelNet40 dataset, the overall accuracy of PointCAT reaches 93.5%, exceeding most existing methods when there are 1024 input points. - **Part segmentation**: On the ShapeNetPart dataset, PointCAT in the category average intersection - over - union (mIoU) and real

PointCAT: Cross-Attention Transformer for point cloud

CAT: Cross Attention in Vision Transformer

3DPCT: 3D Point Cloud Transformer with Dual Self-attention

MPCT: Multiscale Point Cloud Transformer with a Residual Network

CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning

EGCT: Enhanced Graph Convolutional Transformer for 3D Point Cloud Representation Learning

PVT: Point-Voxel Transformer for Point Cloud Learning

Point Cloud Understanding via Attention-Driven Contrastive Learning

3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification

Point Tree Transformer for Point Cloud Registration

Point Cloud Classification Based on Transformer

PatchFormer: an Efficient Point Transformer with Patch Attention

PCT: Point cloud transformer

PointDKT: Dual-Key Transformer for Point Cloud

PU-Transformer: Point Cloud Upsampling Transformer

PCTN: Point Cloud Data Transformation Network

Learning Cross-Attention Point Transformer With Global Porous Sampling

Point Transformer V3: Simpler, Faster, Stronger

Multi-scale Geometry-aware Transformer for 3D Point Cloud Classification

PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture

APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding