From Object-Attribute-Relation Semantic Representation to Video Generation: A Multiple Variational Autoencoder Approach

Yiping Duan,Mingzhe Li,Lijia Wen,Qianqian Yang,Xiaoming Tao
DOI: https://doi.org/10.1109/mlsp55214.2022.9943394
2022-01-01
Abstract:Video generation refers to synthesizing high-resolution video from latent representations or features. In an end-to-end encoder-decoder generation framework, the intermediate latent representation is expected to contain important semantic information within a small amount of structural data, such that the generated videos have high-fidelity and good perceptual quality. With these considerations, we propose a multiple variational autoencoder approach for video generation with object-attribute-relation (OAR) model. The proposed framework generates a video by decoding semantic latent representations in an OAR pattern (objects, attributes and their relations) into plausible high-fidelity videos. Specifically, the videos are first represented in terms of a well-organized, and easily parsed OAR structure and the remaining background. We use multiple encoders to learn the latent embeddings of objects, attributes, relations, and the remaining background separately, which are viewed as different semantic components. Correspondingly, multiple decoders are used to reconstruct these components, which are then fused by a UNet to generate the full videos. We improve the video generation quality by introducing the relations between the objects. Experimental results on the challenging Google Research Football dataset, along with detailed comparison to the advanced methods, verify the effectiveness of the proposed framework.
What problem does this paper attempt to address?