Abstract:Video prediction presents a formidable challenge, requiring effectively processing spatial and temporal information embedded in videos. While recurrent neural network (RNN) and transformer-based models have been extensively explored to address spatial changes over time, recent advancements in convolutional neural networks (CNNs) have yielded high-performance video prediction models. CNN-based models offer advantages over RNN and transformer-based models due to their ease of parallel processing and lower computational complexity, highlighting their significance in practical applications. However, existing CNN-based video prediction models typically treat the spatiotemporal channels of videos similarly to the channel axis of static images. They stack frames in temporal order to construct a spatiotemporal axis and employ standard convolution operations. Nevertheless, this approach has its limitations. Applying convolution directly to the spatiotemporal axis results in a mixture of temporal and spatial information, which may lead to computational inefficiencies and reduced accuracy. Additionally, this operation needs to improve in processing temporal data. This study introduces a CNN-based time series decomposition model for video prediction. The proposed model first divides the convolution operation within the channel aggregation module to independently process the temporal and spatial dimensions. To capture evolving features, the temporal axis is segregated into trend and residual components, followed by applying a time series decomposition forecasting method. To assess the performance of the proposed technique, experiments were conducted using the moving MNIST, KTH, and KITTI-Caltech benchmark datasets. In the experiments on moving MNIST, despite a reduction of approximately 55% in the number of parameters and 37% in computational cost, the proposed method improved accuracy by up to 7% compared to the previous approach.

Space-Time Separate Modeling for Efficient Video Classification

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

TransVOS: Video Object Segmentation with Transformers

Spatio-Temporal Collaborative Module for Efficient Action Recognition

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Dynamic information enhancement for video classification

Dynamic Spatio-Temporal Feature Learning via Graph Convolution in 3D Convolutional Networks

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

Space-time video super-resolution using long-term temporal feature aggregation

Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification

Design Light-weight 3D Convolutional Networks for Video Recognition Temporal Residual, Fully Separable Block, and Fast Algorithm

Temporal Modulation Network for Controllable Space-Time Video Super-Resolution

Space or time for video classification transformers

Space-time video super-resolution via multi-scale feature interpolation and temporal feature fusion

Space-Time Video Super-resolution with Neural Operator

Intelligent 3D Network Protocol for Multimedia Data Classification using Deep Learning

Learning Spatiotemporal Interactions for User-Generated Video Quality Assessment

CNN-Based Time Series Decomposition Model for Video Prediction

Modelling a Spatial-Motion Deep Learning Framework to Classify Dynamic Patterns of Videos

Maximizing Spatio-Temporal Entropy of Deep 3D CNNs for Efficient Video Recognition