Abstract:Spatiotemporal predictive learning is a deep learning method that generates future frames from historical frames in a self-supervised manner. Existing studies face the challenges in capturing long-term dependencies and producing accurate predictions over extended time horizons. To address these limitations, this paper introduces a nested attention module as a special attention mechanism to capture spatiotemporal correlations of input historical frames. Nested attention module decomposes temporal attention into inter-frame channel attention and spatiotemporal attention and uses a nested attention mechanism to capture long-term temporal dependencies, which improves the model's performance and generalization ability. Furthermore, to prevent overfitting in models, a new regularization method is proposed which considers both the intra-frame spatial error and the inter-frame temporal evolution error of sequence frames, and enhances the robustness of the reinforcement learning model to dropout operations. The proposed model achieves state-of-the-art performance on four baseline datasets, including moving MNIST handwritten digit dataset, human 3.6 million dataset, sea surface temperature dataset, and karlsruhe institute of technology and Toyota technological institute dataset. Extended experiments demonstrate the generalization and extensibility of nested attention module on real-world datasets. A dramatic 31.7% mean squared error/26.9% mean absolute error reduction is achieved when predicting 10 frames on moving MNIST. Our proposed model provides a new baseline for future research in spatiotemporal predictive learning tasks.

PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs

PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning

PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning

RSNN: Recurrent Spiking Neural Networks for Dynamic Spatial-Temporal Information Processing

Interpretable and Generalizable Spatiotemporal Predictive Learning with Disentangled Consistency

Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning

PredCNN: Predictive Learning with Cascade Convolutions

Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations

Revisiting the Temporal Modeling in Spatio-Temporal Predictive Learning under A Unified View

Z-Order Recurrent Neural Networks For Video Prediction

MS-RNN: A Flexible Multi-Scale Framework for Spatiotemporal Predictive Learning

Triplet Attention Transformer for Spatiotemporal Predictive Learning

PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

Modernn: towards fine-grained motion details for spatiotemporal predictive learning

PSRUNet: a recurrent neural network for spatiotemporal sequence forecasting based on parallel simple recurrent unit

Enhancing spatiotemporal predictive learning: an approach with nested attention module

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

SimVP: Towards Simple yet Powerful Spatiotemporal Predictive Learning

VMRNN: Integrating Vision Mamba and LSTM for Efficient and Accurate Spatiotemporal Forecasting

MotionRNN: A Flexible Model for Video Prediction with Spacetime-Varying Motions

ModeRNN: Harnessing Spatiotemporal Mode Collapse in Unsupervised Predictive Learning