Predicting Diverse Future Frames with Local Transformation-Guided Masking.

Jinzhuo Wang,Wenmin Wang,Wen Gao
DOI: https://doi.org/10.1109/tcsvt.2018.2882061
IF: 5.859
2019-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Video prediction is the challenging task of generating the future frames of a video given a sequence of previously observed frames. This task involves the construction of an internal representation that accurately models the frame evolutions, including contents and dynamics. Video prediction is considered difficult due to the inherent compounding of errors in recursive pixel level prediction. In this paper, we present a novel video prediction system that focuses on regions of interest (ROIs) rather than on entire frames and learns frame evolutions at the transformation level rather than at the pixel level. We provide two strategies to generate high-quality ROIs that contains potential moving visual cues. The frame evolutions are modeled with a transformation generator that produces transformers and masks simultaneously, which are then combined to generate the future frame in a transformation-guided masking procedure. Compared with recent approaches, our system is able to generate more accurate predictions by modeling the visual evolutions at the transformation level rather than at the pixel level. Focusing on ROIs avoids a heavy computational burden and enables our system to generate high-quality long-term future frames without severely amplified signal loss. Moreover, our system is able to generate diverse plausible future frames, which is important in many real-world scenarios. Furthermore, we enable our system to perform video prediction conditioned on a single frame by revising the transformation generator to produce motion-centric transformers. We test our system on four datasets with different experimental settings and demonstrate its advantages over recent methods, both quantitatively and qualitatively.
What problem does this paper attempt to address?