Abstract:Video prediction presents a formidable challenge, requiring effectively processing spatial and temporal information embedded in videos. While recurrent neural network (RNN) and transformer-based models have been extensively explored to address spatial changes over time, recent advancements in convolutional neural networks (CNNs) have yielded high-performance video prediction models. CNN-based models offer advantages over RNN and transformer-based models due to their ease of parallel processing and lower computational complexity, highlighting their significance in practical applications. However, existing CNN-based video prediction models typically treat the spatiotemporal channels of videos similarly to the channel axis of static images. They stack frames in temporal order to construct a spatiotemporal axis and employ standard convolution operations. Nevertheless, this approach has its limitations. Applying convolution directly to the spatiotemporal axis results in a mixture of temporal and spatial information, which may lead to computational inefficiencies and reduced accuracy. Additionally, this operation needs to improve in processing temporal data. This study introduces a CNN-based time series decomposition model for video prediction. The proposed model first divides the convolution operation within the channel aggregation module to independently process the temporal and spatial dimensions. To capture evolving features, the temporal axis is segregated into trend and residual components, followed by applying a time series decomposition forecasting method. To assess the performance of the proposed technique, experiments were conducted using the moving MNIST, KTH, and KITTI-Caltech benchmark datasets. In the experiments on moving MNIST, despite a reduction of approximately 55% in the number of parameters and 37% in computational cost, the proposed method improved accuracy by up to 7% compared to the previous approach.

PredCNN: Predictive Learning with Cascade Convolutions

Deep Neural Network Acceleration with Sparse Prediction Layers

Frame Prediction Using Recurrent Convolutional Encoder with Residual Learning

PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning

PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs

SimVP: Simpler Yet Better Video Prediction

PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Predictive Coding Based Multiscale Network with Encoder-Decoder LSTM for Video Prediction

CNN-Based Time Series Decomposition Model for Video Prediction

A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction

Folded Recurrent Neural Networks for Future Video Prediction

From Single to Multiple: Leveraging Multi-level Prediction Spaces for Video Forecasting

Algebraic Representations for Faster Predictions in Convolutional Neural Networks

Video prediction: a step-by-step improvement of a video synthesis network

A Dynamic Multi-Scale Voxel Flow Network for Video Prediction

Z-Order Recurrent Neural Networks For Video Prediction

PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

Predicting Future Instance Segmentation by Forecasting Convolutional Features

Efficient Continuous Video Flow Model for Video Prediction

Video Frame Prediction by Deep Multi-Branch Mask Network