Scripted Video Generation With a Bottom-Up Generative Adversarial Network

A. van den Hengel,Qingyao Wu,Qi Wu,Qi Chen,Mingkui Tan,Jian Chen
DOI: https://doi.org/10.1109/TIP.2020.3003227
IF: 10.6
2020-07-10
IEEE Transactions on Image Processing
Abstract:Generating videos given a text description (such as a script) is non-trivial due to the intrinsic complexity of image frames and the structure of videos. Although Generative Adversarial Networks (GANs) have been successfully applied to generate images conditioned on a natural language description, it is still very challenging to generate realistic videos in which the frames are required to follow both spatial and temporal coherence. In this paper, we propose a novel Bottom-up GAN (BoGAN) method for generating videos given a text description. To ensure the coherence of the generated frames and also make the whole video match the language descriptions semantically, we design a bottom-up optimisation mechanism to train BoGAN. Specifically, we devise a region-level loss via attention mechanism to preserve the local semantic alignment and draw details in different sub-regions of video conditioned on words which are most relevant to them. Moreover, to guarantee the matching between text and frame, we introduce a frame-level discriminator, which can also maintain the fidelity of each frame and the coherence across frames. Last, to ensure the global semantic alignment between whole video and given text, we apply a video-level discriminator. We evaluate the effectiveness of the proposed BoGAN on two synthetic datasets (i.e., SBMG and TBMG) and two real-world datasets (i.e., MSVD and KTH).
Computer Science
What problem does this paper attempt to address?