Triplet Attention Transformer for Spatiotemporal Predictive Learning

Xuesong Nie,Xi Chen,Haoyuan Jin,Zhihang Zhu,Yunfeng Yan,Donglian Qi
2023-10-28
Abstract:Spatiotemporal predictive learning offers a self-supervised learning paradigm that enables models to learn both spatial and temporal patterns by predicting future sequences based on historical sequences. Mainstream methods are dominated by recurrent units, yet they are limited by their lack of parallelization and often underperform in real-world scenarios. To improve prediction quality while maintaining computational efficiency, we propose an innovative triplet attention transformer designed to capture both inter-frame dynamics and intra-frame static features. Specifically, the model incorporates the Triplet Attention Module (TAM), which replaces traditional recurrent units by exploring self-attention mechanisms in temporal, spatial, and channel dimensions. In this configuration: (i) temporal tokens contain abstract representations of inter-frame, facilitating the capture of inherent temporal dependencies; (ii) spatial and channel attention combine to refine the intra-frame representation by performing fine-grained interactions across spatial and channel dimensions. Alternating temporal, spatial, and channel-level attention allows our approach to learn more complex short- and long-range spatiotemporal dependencies. Extensive experiments demonstrate performance surpassing existing recurrent-based and recurrent-free methods, achieving state-of-the-art under multi-scenario examination including moving object trajectory prediction, traffic flow prediction, driving scene prediction, and human motion capture.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper is primarily dedicated to addressing issues in spatiotemporal predictive learning, specifically improving prediction quality and maintaining computational efficiency. Specifically, the research proposes new solutions to the limitations of the two existing mainstream methods—recurrent unit-based methods and non-recurrent methods. 1. **Recurrent unit-based methods**: These methods dominate spatiotemporal prediction tasks because of their advantages in temporal modeling. However, they are limited by their inability to process in parallel and often perform poorly in real-world scenarios. 2. **Non-recurrent methods**: Although these methods are more computationally efficient, they still lag behind recurrent methods in certain scenarios, especially in robust modeling of inter-frame and intra-frame changes. To address the above issues, the paper proposes an innovative Triplet Attention Transformer, which aims to learn complex short-range and long-range spatiotemporal dependencies through a parallelized pure attention framework while maintaining computational efficiency. The core of this approach is the Triplet Attention Module (TAM), which combines three types of attention mechanisms: - **Temporal Attention**: Used to capture dynamics between frames. - **Spatial Attention**: Used to refine intra-frame representations through fine-grained interactions in the spatial dimension. - **Channel Attention**: Also operates within frames, further refining representations through interactions in the channel dimension. By alternating these three attention mechanisms, the model can more effectively learn complex spatiotemporal patterns. Experimental results show that this method outperforms existing recurrent and non-recurrent methods in various scenarios (including moving object trajectory prediction, traffic flow prediction, driving scene prediction, and human motion capture) and achieves state-of-the-art performance on multiple datasets.