Abstract:Dynamic 3D point cloud sequences serve as one of the most common and practical representation modalities of dynamic real-world environments. However, their unstructured nature in both spatial and temporal domains poses significant challenges to effective and efficient processing. Existing deep point cloud sequence modeling approaches imitate the mature 2D video learning mechanisms by developing complex spatio-temporal point neighbor grouping and feature aggregation schemes, often resulting in methods lacking effectiveness, efficiency, and expressive power. In this paper, we propose a novel generic representation called \textit{Structured Point Cloud Videos} (SPCVs). Intuitively, by leveraging the fact that 3D geometric shapes are essentially 2D manifolds, SPCV re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points. The structured nature of our SPCV representation allows for the seamless adaptation of well-established 2D image/video techniques, enabling efficient and effective processing and analysis of 3D point cloud sequences. To achieve such re-organization, we design a self-supervised learning pipeline that is geometrically regularized and driven by self-reconstructive and deformation field learning objectives. Additionally, we construct SPCV-based frameworks for both low-level and high-level 3D point cloud sequence processing and analysis tasks, including action recognition, temporal interpolation, and compression. Extensive experiments demonstrate the versatility and superiority of the proposed SPCV, which has the potential to offer new possibilities for deep learning on unstructured 3D point cloud sequences. Code will be released at <a class="link-external link-https" href="https://github.com/ZENGYIMING-EAMON/SPCV" rel="external noopener nofollow">this https URL</a>.

PointSDA: Spatio-temporal Deformable Attention Network for Point Cloud Video Modeling

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

SCA-Net: Spatial and channel attention-based network for 3D point clouds

Point Attention Network for Point Cloud Semantic Segmentation.

Point Attention Network for Semantic Segmentation of 3D Point Clouds

Deep Hierarchical Representation of Point Cloud Videos via Spatio-Temporal Decomposition

SDANet: spatial deep attention-based for point cloud classification and segmentation

ASAP-Net: Attention and Structure Aware Point Cloud Sequence Segmentation

Point Spatio-Temporal Transformer Networks for Point Cloud Video Modeling

Point Deformable Network with Enhanced Normal Embedding for Point Cloud Analysis

Spatio-Temporal Deformable Attention Network for Video Deblurring

Dynamic 3D Point Cloud Sequences as 2D Videos

On Exploring PDE Modeling for Point Cloud Video Representation Learning

Spatial deformable transformer for 3D point cloud registration

CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

PAV-Net: Point-wise Attention Keypoints Voting Network for Real-time 6D Object Pose Estimation

SCA-PVNet: Self-and-cross attention based aggregation of point cloud and multi-view for 3D object retrieval