Abstract:With the development of computer science and deep learning networks, AI generation technology is becoming increasingly mature. Video has become one of the most important information carriers in our daily life because of their large amount of data and information. However, because of their large amount of information and complex semantics, video generation models, especially High Definition (HD) video, have been a difficult problem in the field of deep learning. Video semantic representation and semantic reconstruction are difficult tasks. Because video content is changeable and information is highly correlated, we propose a HD video generation model from a spatio-temporal scene graph: the spatio-temporal scene graph to video (StSg2vid) model. First, we enter the spatio-temporal scene graph sequence as the semantic representation model of the information in each frame of the video. The scene graph used to describe the semantic information of each frame contains the motion progress of the object in the video at that moment, which is equivalent to a clock. A spatio-temporal scene graph transmits the relationship information between objects through the graph convolutional neural network and predicts the scene layout of the moment. Lastly, the image generation model predicts the frame image of the current moment. The frame at each moment depends on the scene layout at the current moment and the frame and scene layout at the previous moment. We introduced the flow net, wrapping prediction model and the spatially-adaptive normalization (SPADE) network to generate images of each frame forecast. We used the Action genome dataset. Compared with the current state-of-the-art algorithms, the videos generated by our model achieve better results in both quantitative indicators and user evaluations. In addition, we also generalized the StSg2vid model into virtual reality (VR) videos of indoor scenes, preliminarily explored the generation method of VR videos, and achieved good results.

Video Reconstruction with Multimodal Information

An Unsupervised Video Summarization Method Based on Multimodal Representation.

Semantic Reconstruction based on RGB Image and Sparse Depth

VR+HD: Video Semantic Reconstruction from Spatio-Temporal Scene Graphs

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Reconstructing Rapid Natural Vision with fMRI-Conditional Video Generative Adversarial Network

Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity

X-GANs: Image Reconstruction Made Easy for Extreme Cases

Generative Adversarial Networks for Multimodal Representation Learning in Video Hyperlinking

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction.

Visual Data Synthesis Via GAN for Zero-Shot Video Classification

High-order relational generative adversarial network for video super-resolution

Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs

CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation

2D GANs Meet Unsupervised Single-view 3D Reconstruction

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

Video Content Swapping Using GAN

Semantics-Guided Hierarchical Feature Encoding Generative Adversarial Network for Visual Image Reconstruction From Brain Activity

Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

Deep Cross-View Reconstruction GAN Based on Correlated Subspace for Multi-View Transformation