Video Frame Prediction with Dual-Stream Deep Network Emphasizing Motions and Content Details.

Qingming Huang,Zhongxiao Li,Liying Zheng,Tianyi Yang
DOI: https://doi.org/10.1016/j.asoc.2022.109170
IF: 8.7
2022-01-01
Applied Soft Computing
Abstract:Video frame prediction is both challenging and critical for computer vision. Though the research on predicting video frames has gradually shifted from pixel-law based methods to motion based ones, existing predictors often generate ambiguous future frames, especially for long-term predictions. This paper proposes a composed model to generate future frames with more details. First, to further exploit motion information, we design a single motion decoder to strengthen the efficiency of the motion encoder in the original motion-content network (MCnet). Second, to alleviate prediction ambiguousness, we use both edges with and without semantic meanings from the holistically-nested edge detection (HED) module as content details. Third, based on the conclusion that the mean squared error (MSE) loss and the traditional generative adversarial learning framework cause the unsatisfied predictions of MCnet, we design a composite loss function that can guide our model to simultaneously focus on motions and content details. Also, based on the abovementioned conclusion, we finally embed our model in an improved generative adversarial network, which further enhances its performance. Experimental results on the benchmark KTH and UCF101 datasets show that our model outperforms the state-of-the-art predictors, such as the basic MCnet, the predictive neural network (PredNet), and the PredNet with a reduced-gate convolutional network (rgc-PredNet), in terms of peak signal to noise ratio (PSNR) and structural similarity index measure (SSIM), especially for long-term video frame prediction.
What problem does this paper attempt to address?