PointSDA: Spatio-temporal Deformable Attention Network for Point Cloud Video Modeling

Xiaoxiao Sheng,Zhiqiang Shen,Gang Xiao
DOI: https://doi.org/10.1109/lra.2024.3477303
IF: 5.2
2024-01-01
IEEE Robotics and Automation Letters
Abstract:Point cloud videos are constituted of serialized point clouds captured continuously by the agent. Point cloud video modeling is significant for real-world action analysis and dynamic scene perception. Aggregating multi-level features is advantageous for these downstream tasks, while it is overlooked in previous point cloud video networks. In this paper, we propose a spatio-temporal deformable attention network (PointSDA) to aggregate multi-level features of point cloud videos. Specifically, the point deformable attention mechanism is developed to focus on relevant regions with more informative features adaptively. By parsing the query, the sparse positions of keys are predicted, and corresponding attention scores are computed under the offset guidance. Then, these scores are utilized to perform attention-based aggregation on the values obtained by point interpolation. Furthermore, point deformable attention is employed to fuse multi-level features in a spatio-temporal decoupled manner. In this way, hierarchical information is captured to facilitate action recognition and dynamic semantic segmentation on point cloud videos. Extensive experiments and ablation studies on multiple benchmark datasets demonstrate the effectiveness of our method.
What problem does this paper attempt to address?