Video Reconstruction with Multimodal Information

Zhipeng Xie,Yiping Duan,Qiyuan Du,Xiaoming Tao,Jiazhong Yu
DOI: https://doi.org/10.1109/vtc2023-fall60731.2023.10333705
2023-01-01
Abstract:Video reconstruction refers to generate videos through the high-level representations (edge map, labels and so on), while the reconstruction quality is always unsatisfactory due to sparse high-level representations, especially on video data. In order to improve the video reconstruction quality, we proposed a novel approach that generates realistic video from its multimodal information including structure features and color features. To extract color features, we mainly apply the k-means algorithm to segment labels and the structure features are extracted by an edge detection network. Video generation is regarded as learning the mapping from multimodal representations to the original videos. So, a conditional GAN is applied with a learning objective that models the temporal video dynamics. We use a spatio-temporal generator with attention to model the inter-frame dynamics and video consistency is improved in this way. Moreover, we use a multiscale discriminator to improve the improve the intra-frame quality of the video. Experimental results on Cityscapes, Apolloscape datasets demonstrate that our proposed approach performs better in both traditional and generative evaluating indicators.
What problem does this paper attempt to address?