Abstract:Video prediction has always been a very challenging problem in video representation learning due to the complexity in spatial structure and temporal variation. However, existing methods mainly predict videos by employing language-based memory structures from the traditional Long Short-Term Memories (LSTMs) or Gated Recurrent Units (GRUs), which may not be powerful enough to model the long-term dependencies in videos, consisting of much more complex spatiotemporal dynamics than sentences. In this paper, we propose a SpatioTemporal Attention based Memory (STAM), which can efficiently improve the long-term spatiotemporal memorizing capacity by incorporating the global spatiotemporal information in videos. In the temporal domain, the proposed STAM aims to observe temporal states from a wider temporal receptive field to capture accurate global motion information. In the spatial domain, the proposed STAM aims to jointly utilize both the high-level semantic spatial state and the low-level texture spatial states to model a more reliable global spatial representation for videos. In particular, the global spatiotemporal information is extracted with the help of an Efficient SpatioTemporal Attention Gate (ESTAG), which can adaptively apply different levels of attention scores to different spatiotemporal states according to their importance. Moreover, the proposed STAM are built with 3D convolutional layers due to their advantages in modeling spatiotemporal dynamics for videos. Experimental results show that the proposed STAM can achieve state-of-the-art performance on widely used datasets by leveraging the proposed spatiotemporal representations for videos.

Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning

Interpretable and Generalizable Spatiotemporal Predictive Learning with Disentangled Consistency

Triplet Attention Transformer for Spatiotemporal Predictive Learning

Revisiting the Temporal Modeling in Spatio-Temporal Predictive Learning under A Unified View

Enhancing spatiotemporal predictive learning: an approach with nested attention module

STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond

Wavelet-Driven Spatiotemporal Predictive Learning: Bridging Frequency and Time Variations

SimVP: Towards Simple yet Powerful Spatiotemporal Predictive Learning

Attention-Based Deep Spiking Neural Networks for Temporal Credit Assignment Problems.

Self-Attention ConvLSTM for Spatiotemporal Prediction

HSTA: A Hierarchical Spatio-Temporal Attention Model for Trajectory Prediction

Temporal pattern attention for multivariate time series forecasting

PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs

Spatiotemporal Attention for Multivariate Time Series Prediction and Interpretation

STCA: Spatio-Temporal Credit Assignment with Delayed Feedback in Deep Spiking Neural Networks

STAM: A SpatioTemporal Attention Based Memory for Video Prediction

Enhancing Spatiotemporal Prediction Model using Modular Design and Beyond

A Spatial–Temporal Attention Approach for Traffic Prediction

Spatial and Temporal Visual Attention Prediction in Videos Using Eye Movement Data

Spatiotemporal Data Prediction Model Based on a Multi-Layer Attention Mechanism

Is Single Enough? A Joint Spatiotemporal Feature Learning Framework for Multivariate Time Series Prediction