Abstract:Spatiotemporal predictive learning offers a self-supervised learning paradigm that enables models to learn both spatial and temporal patterns by predicting future sequences based on historical sequences. Mainstream methods are dominated by recurrent units, yet they are limited by their lack of parallelization and often underperform in real-world scenarios. To improve prediction quality while maintaining computational efficiency, we propose an innovative triplet attention transformer designed to capture both inter-frame dynamics and intra-frame static features. Specifically, the model incorporates the Triplet Attention Module (TAM), which replaces traditional recurrent units by exploring self-attention mechanisms in temporal, spatial, and channel dimensions. In this configuration: (i) temporal tokens contain abstract representations of inter-frame, facilitating the capture of inherent temporal dependencies; (ii) spatial and channel attention combine to refine the intra-frame representation by performing fine-grained interactions across spatial and channel dimensions. Alternating temporal, spatial, and channel-level attention allows our approach to learn more complex short- and long-range spatiotemporal dependencies. Extensive experiments demonstrate performance surpassing existing recurrent-based and recurrent-free methods, achieving state-of-the-art under multi-scenario examination including moving object trajectory prediction, traffic flow prediction, driving scene prediction, and human motion capture.

What problem does this paper attempt to address?

The paper is primarily dedicated to addressing issues in spatiotemporal predictive learning, specifically improving prediction quality and maintaining computational efficiency. Specifically, the research proposes new solutions to the limitations of the two existing mainstream methods—recurrent unit-based methods and non-recurrent methods. 1. **Recurrent unit-based methods**: These methods dominate spatiotemporal prediction tasks because of their advantages in temporal modeling. However, they are limited by their inability to process in parallel and often perform poorly in real-world scenarios. 2. **Non-recurrent methods**: Although these methods are more computationally efficient, they still lag behind recurrent methods in certain scenarios, especially in robust modeling of inter-frame and intra-frame changes. To address the above issues, the paper proposes an innovative Triplet Attention Transformer, which aims to learn complex short-range and long-range spatiotemporal dependencies through a parallelized pure attention framework while maintaining computational efficiency. The core of this approach is the Triplet Attention Module (TAM), which combines three types of attention mechanisms: - **Temporal Attention**: Used to capture dynamics between frames. - **Spatial Attention**: Used to refine intra-frame representations through fine-grained interactions in the spatial dimension. - **Channel Attention**: Also operates within frames, further refining representations through interactions in the channel dimension. By alternating these three attention mechanisms, the model can more effectively learn complex spatiotemporal patterns. Experimental results show that this method outperforms existing recurrent and non-recurrent methods in various scenarios (including moving object trajectory prediction, traffic flow prediction, driving scene prediction, and human motion capture) and achieves state-of-the-art performance on multiple datasets.

Triplet Attention Transformer for Spatiotemporal Predictive Learning

Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning

PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

Enhancing spatiotemporal predictive learning: an approach with nested attention module

A Spatio-Temporal Transformer Network for Human Motion Prediction in Human-Robot Collaboration

Revisiting the Temporal Modeling in Spatio-Temporal Predictive Learning under A Unified View

Self-Attention ConvLSTM for Spatiotemporal Prediction

Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations

Spatial-Channel Transformer Network for Trajectory Prediction on the Traffic Scenes

Hybrid Transformer and Spatial-Temporal Self-Supervised Learning for Long-term Traffic Prediction

Spatiotemporal Attention for Multivariate Time Series Prediction and Interpretation

Rethinking Spatio-Temporal Transformer for Traffic Prediction:Multi-level Multi-view Augmented Learning Framework

PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs

NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting

HSTA: A Hierarchical Spatio-Temporal Attention Model for Trajectory Prediction

GSSTU: Generative Spatial Self-Attention Transformer Unit for Enhanced Video Prediction

PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning

Shifted Chunk Transformer for Spatio-Temporal Representational Learning

Space or time for video classification transformers

Spatial linear transformer and temporal convolution network for traffic flow prediction

ASTM - an Attention Based Spatiotemporal Model for Video Prediction Using 3D Convolutional Neural Networks.