Point Cloud Recognition with Position-to-Structure Attention Transformers

Zheng Ding,James Hou,Zhuowen Tu
DOI: https://doi.org/10.48550/arXiv.2210.02030
2022-10-05
Abstract:In this paper, we present Position-to-Structure Attention Transformers (PS-Former), a Transformer-based algorithm for 3D point cloud recognition. PS-Former deals with the challenge in 3D point cloud representation where points are not positioned in a fixed grid structure and have limited feature description (only 3D coordinates ($x, y, z$) for scattered points). Existing Transformer-based architectures in this domain often require a pre-specified feature engineering step to extract point features. Here, we introduce two new aspects in PS-Former: 1) a learnable condensation layer that performs point downsampling and feature extraction; and 2) a Position-to-Structure Attention mechanism that recursively enriches the structural information with the position attention branch. Compared with the competing methods, while being generic with less heuristics feature designs, PS-Former demonstrates competitive experimental results on three 3D point cloud tasks including classification, part segmentation, and scene segmentation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively extract useful structural information from unordered scatter point sets with limited feature descriptions in 3D point cloud recognition. Specifically, the paper points out that there are two main challenges in current Transformer - based methods when dealing with 3D point cloud data: 1. **Unorderedness of point cloud data**: Unlike voxel - based representation methods, point cloud data is composed of scatter points without a fixed order, which makes it difficult for traditional 3D convolution operations to be directly applied. 2. **Limited feature descriptions of points**: Each point only contains 3D coordinate information (\(x, y, z\)), lacking rich explicit feature descriptions. Understanding the overall shape and the object parts represented by the point cloud partly depends on extracting "correct" and "useful" features through the relationship (context) between a point and its neighboring points. To solve these problems, the paper proposes **Position - to - Structure Attention Transformers (PS - Former)**, and this model has the following two innovative points: 1. **Learnable Condensation Layer**: Automatically perform point cloud down - sampling and feature extraction. Different from the PCT method based on Transformer, PS - Former uses the internal self - attention matrix to calculate the relationships between points, thereby extracting structural features. 2. **Position - to - Structure Attention Mechanism**: Recursively use the position - attention branch to enrich structural information. This mechanism is different from the standard cross - attention mechanism, in which two working branches pay attention to each other in a symmetric way. Through these designs, PS - Former can effectively learn the representation of 3D point clouds without relying on preset feature engineering steps, and has demonstrated competitiveness in three 3D point cloud tasks (classification, part segmentation, and scene segmentation).