Abstract:Video prediction is the challenging task of generating the future frames of a video given a sequence of previously observed frames. This task involves the construction of an internal representation that accurately models the frame evolutions, including contents and dynamics. Video prediction is considered difficult due to the inherent compounding of errors in recursive pixel level prediction. In this paper, we present a novel video prediction system that focuses on regions of interest (ROIs) rather than on entire frames and learns frame evolutions at the transformation level rather than at the pixel level. We provide two strategies to generate high-quality ROIs that contains potential moving visual cues. The frame evolutions are modeled with a transformation generator that produces transformers and masks simultaneously, which are then combined to generate the future frame in a transformation-guided masking procedure. Compared with recent approaches, our system is able to generate more accurate predictions by modeling the visual evolutions at the transformation level rather than at the pixel level. Focusing on ROIs avoids a heavy computational burden and enables our system to generate high-quality long-term future frames without severely amplified signal loss. Moreover, our system is able to generate diverse plausible future frames, which is important in many real-world scenarios. Furthermore, we enable our system to perform video prediction conditioned on a single frame by revising the transformation generator to produce motion-centric transformers. We test our system on four datasets with different experimental settings and demonstrate its advantages over recent methods, both quantitatively and qualitatively.

Motion Selective Prediction for Video Frame Synthesis

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Motion = Video - Content: Towards Unsupervised Learning of Motion Representation from Videos

Motion and Context-Aware Audio-Visual Conditioned Video Prediction

MMVP: Motion-Matrix-based Video Prediction

Adaptive Recurrent Frame Prediction with Learnable Motion Vectors.

Motion Graph Unleashed: A Novel Approach to Video Prediction

Flexible Spatio-Temporal Networks for Video Prediction

Deep Learned Frame Prediction for Video Compression

A lightweight multi-granularity asymmetric motion mode video frame prediction algorithm

Dual Motion GAN for Future-Flow Embedded Video Prediction

Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes

Predicting Diverse Future Frames with Local Transformation-Guided Masking.

Video Frame Prediction by Deep Multi-Branch Mask Network

Video prediction: a step-by-step improvement of a video synthesis network

Predicting Long-horizon Futures by Conditioning on Geometry and Time

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

Video Frame Prediction from a Single Image and Events

Disentangling Propagation and Generation for Video Prediction

Decomposing Motion and Content for Natural Video Sequence Prediction

Text-driven Video Prediction