FESTA: Flow Estimation via Spatial-Temporal Attention for Scene Point Clouds

Haiyan Wang,Jiahao Pang,Muhammad A. Lodhi,Yingli Tian,Dong Tian
DOI: https://doi.org/10.48550/arXiv.2104.00798
2021-12-07
Abstract:Scene flow depicts the dynamics of a 3D scene, which is critical for various applications such as autonomous driving, robot navigation, AR/VR, etc. Conventionally, scene flow is estimated from dense/regular RGB video frames. With the development of depth-sensing technologies, precise 3D measurements are available via point clouds which have sparked new research in 3D scene flow. Nevertheless, it remains challenging to extract scene flow from point clouds due to the sparsity and irregularity in typical point cloud sampling patterns. One major issue related to irregular sampling is identified as the randomness during point set abstraction/feature extraction -- an elementary process in many flow estimation scenarios. A novel Spatial Abstraction with Attention (SA^2) layer is accordingly proposed to alleviate the unstable abstraction problem. Moreover, a Temporal Abstraction with Attention (TA^2) layer is proposed to rectify attention in temporal domain, leading to benefits with motions scaled in a larger range. Extensive analysis and experiments verified the motivation and significant performance gains of our method, dubbed as Flow Estimation via Spatial-Temporal Attention (FESTA), when compared to several state-of-the-art benchmarks of scene flow estimation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of estimating scene flow from point - cloud data. Specifically, traditional scene - flow estimation methods mainly rely on dense/regular RGB video frames. However, with the development of depth - sensing technology, it has become possible to obtain accurate 3D measurements through point clouds, which has inspired a new research direction for 3D scene flow. Nevertheless, due to the sparsity and irregularity of the point - cloud sampling pattern, extracting the scene flow from point clouds remains challenging. In particular, the randomness existing in the point - set abstraction/feature - extraction process leads to the problem of unstable abstraction, which is a fundamental process in many flow - estimation scenarios. To this end, the paper proposes two innovative layers: 1. **Spatial Abstraction Attention (SA2) layer**: It aims to alleviate the problem of unstable abstraction. By introducing a trainable Aggregation Pooling (AP) module, the SA2 layer can generate more stable down - sampled points, thereby defining more stable regions of attention. 2. **Temporal Abstraction Attention (TA2) layer**: It is used to correct attention in the time domain, so as to better handle motions at different scales. Through the initial scene - flow estimation, the TA2 layer can adjust the regions of attention in time to more corresponding positions. Through these innovations, the method proposed in the paper (called FESTA) has demonstrated significant performance improvements in multiple benchmark tests, especially in synthetic and real - world scene - flow estimation tasks. ### Main Contributions 1. **Proposing the SA2 layer**: It achieves stable point - cloud abstraction and can generate invariant position points regardless of how the point cloud is sampled from the scene manifold, thereby defining stable regions of attention. The effectiveness of the SA2 layer has been verified theoretically and empirically. 2. **Proposing the TA2 layer**: It can estimate small - scale and large - scale motions by emphasizing the regions where good matches are more likely to be found, regardless of the scale of the motion. 3. **FESTA architecture**: In synthetic and real - world benchmark tests, the FESTA architecture has achieved state - of - the - art performance in 3D point - cloud scene - flow estimation, significantly outperforming existing scene - flow estimation methods. ### Experimental Verification The paper verifies the stability of the SA2 layer and the overall performance of the FESTA architecture through a series of experiments. The experimental results show that the SA2 layer is significantly superior to the traditional Farthest Point Sampling (FPS) method in terms of point - cloud abstraction stability, especially when the point - cloud sampling density is high. In addition, the FESTA architecture performs well in multiple benchmark tests, especially when dealing with large - scale and small - scale motions. ### Conclusion By introducing the SA2 and TA2 layers, this paper effectively solves the key problem of estimating scene flow from point clouds and provides important technical support for fields such as autonomous driving, robot navigation, AR/VR, etc.