Abstract:In recent years, point cloud analysis methods based on the Transformer architecture have made significant progress, particularly in the context of multimedia applications such as 3D modeling, virtual reality, and autonomous systems. However, the high computational resource demands of the Transformer architecture hinder its scalability, real-time processing capabilities, and deployment on mobile devices and other platforms with limited computational resources. This limitation remains a significant obstacle to its practical application in scenarios requiring on-device intelligence and multimedia processing. To address this challenge, we propose an efficient point cloud analysis architecture, \textbf{Point} \textbf{M}LP-\textbf{T}ransformer (PointMT). This study tackles the quadratic complexity of the self-attention mechanism by introducing a linear complexity local attention mechanism for effective feature aggregation. Additionally, to counter the Transformer's focus on token differences while neglecting channel differences, we introduce a parameter-free channel temperature adaptation mechanism that adaptively adjusts the attention weight distribution in each channel, enhancing the precision of feature aggregation. To improve the Transformer's slow convergence speed due to the limited scale of point cloud datasets, we propose an MLP-Transformer hybrid module, which significantly enhances the model's convergence speed. Furthermore, to boost the feature representation capability of point tokens, we refine the classification head, enabling point tokens to directly participate in prediction. Experimental results on multiple evaluation benchmarks demonstrate that PointMT achieves performance comparable to state-of-the-art methods while maintaining an optimal balance between performance and accuracy.

PVT: Point-Voxel Transformer for 3D Deep Learning

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

PVT: Point-Voxel Transformer for Point Cloud Learning

VTPNet for 3D deep learning on point cloud

Regional-to-Local Point-Voxel Transformer for Large-Scale Indoor 3D Point Cloud Semantic Segmentation

PVTransformer: Point-to-Voxel Transformer for Scalable 3D Object Detection

Point-Voxel CNN for Efficient 3D Deep Learning

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

Multi Point-Voxel Convolution (MPVConv) for Deep Learning on Point Clouds

Pix4Point: Image Pretrained Standard Transformers for 3D Point Cloud Understanding

Multi Voxel-Point Neurons Convolution (MVPConv) for Fast and Accurate 3D Deep Learning

DVST: Deformable Voxel Set Transformer for 3D Object Detection from Point Clouds

3DPCT: 3D Point Cloud Transformer with Dual Self-attention

Point Transformer V3: Simpler, Faster, Stronger

Stratified Transformer for 3D Point Cloud Segmentation

CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance

PointCAT: Cross-Attention Transformer for point cloud

3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification

PointMT: Efficient Point Cloud Analysis with Hybrid MLP-Transformer Architecture

VPFNet: Improving 3D Object Detection with Virtual Point based LiDAR and Stereo Data Fusion