Abstract:Video prediction presents a formidable challenge, requiring effectively processing spatial and temporal information embedded in videos. While recurrent neural network (RNN) and transformer-based models have been extensively explored to address spatial changes over time, recent advancements in convolutional neural networks (CNNs) have yielded high-performance video prediction models. CNN-based models offer advantages over RNN and transformer-based models due to their ease of parallel processing and lower computational complexity, highlighting their significance in practical applications. However, existing CNN-based video prediction models typically treat the spatiotemporal channels of videos similarly to the channel axis of static images. They stack frames in temporal order to construct a spatiotemporal axis and employ standard convolution operations. Nevertheless, this approach has its limitations. Applying convolution directly to the spatiotemporal axis results in a mixture of temporal and spatial information, which may lead to computational inefficiencies and reduced accuracy. Additionally, this operation needs to improve in processing temporal data. This study introduces a CNN-based time series decomposition model for video prediction. The proposed model first divides the convolution operation within the channel aggregation module to independently process the temporal and spatial dimensions. To capture evolving features, the temporal axis is segregated into trend and residual components, followed by applying a time series decomposition forecasting method. To assess the performance of the proposed technique, experiments were conducted using the moving MNIST, KTH, and KITTI-Caltech benchmark datasets. In the experiments on moving MNIST, despite a reduction of approximately 55% in the number of parameters and 37% in computational cost, the proposed method improved accuracy by up to 7% compared to the previous approach.

Hierarchical Grid Model for Video Prediction

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

A Hybrid Transformer-LSTM Model With 3D Separable Convolution for Video Prediction

Deep Hierarchical Video Compression

State-space Decomposition Model for Video Prediction Considering Long-term Motion Trend

CNN-Based Time Series Decomposition Model for Video Prediction

Progressive Multi-granularity Analysis for Video Prediction.

Deep Hierarchical Representation of Point Cloud Videos via Spatio-Temporal Decomposition

Motion-Aware Feature Enhancement Network for Video Prediction

MMVP: Motion-Matrix-based Video Prediction

A lightweight multi-granularity asymmetric motion mode video frame prediction algorithm

Hierarchical Video Frame Sequence Representation with Deep Convolutional Graph Network

Learning Hierarchical Video Representation for Action Recognition

MotionRNN: A Flexible Model for Video Prediction with Spacetime-Varying Motions

Learning Hierarchical Embedding for Video Instance Segmentation.

Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes

Predicting Long-horizon Futures by Conditioning on Geometry and Time

Predicting Diverse Future Frames with Local Transformation-Guided Masking.

Motion Graph Unleashed: A Novel Approach to Video Prediction

Efficient Continuous Video Flow Model for Video Prediction