Hierarchical Grid Model for Video Prediction

Qinyu Li,Siyuan Wu,Hanli Wang
DOI: https://doi.org/10.1142/9789811223334_0097
2020-01-01
Abstract:Video prediction has recently drawn more attention for its application potential. However, it is challenging to model long-term prediction since it has to predict dense pixels along both spatial and temporal dimensions. Several recent approaches for long-term video prediction view pixel transforming as a global process among adjacent frames, while the actual position and motion of pixels in real videos are arranged in a hierarchical manner. Inspired by this, a novel hierarchical prediction model is proposed in this work to decompose complex and composite motions of real videos into simple ones based on their locations. This will reduce learning difficulty and fit various movements as well. In addition, high-resolution videos which are harder to model are also investigated, since there are larger ranges of movement and much more details to take care of. The proposed model builds upon a spatial transformer predictor to realize hierarchical structure to learn motions from videos. The experimental results on the benchmark real-world video dataset Human3.6M demonstrate the effectiveness of the proposed model as compared with other baseline approaches.
What problem does this paper attempt to address?