Abstract:We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies, with attention to computational and dataset efficiency. To capture long spatio-temporal dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a single latent code to model an entire video clip. Individual video frames are then synthesized from an intermediate tri-plane representation, which itself is derived from the primary latent code. This novel strategy more than halves the computational complexity measured in FLOPs compared to the most efficient state-of-the-art methods. Consequently, our approach facilitates the efficient and temporally coherent generation of videos. Moreover, our joint frame modeling approach, in contrast to autoregressive methods, mitigates the generation of visual artifacts. We further enhance the model's capabilities by integrating an optical flow-based module within our Generative Adversarial Network (GAN) based generator architecture, thereby compensating for the constraints imposed by a smaller generator size. As a result, our model synthesizes high-fidelity video clips at a resolution of $256\times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps. The efficacy and versatility of our approach are empirically validated through qualitative and quantitative assessments across three different datasets comprising both synthetic and real video clips. We will make our training and inference code public.

What problem does this paper attempt to address?

The paper attempts to address the problem of efficiently generating high-quality videos with long-term spatial and temporal dependencies in the field of video generation. Specifically, the paper tackles the following challenges: 1. **Long-term spatial and temporal dependencies**: Existing unconditional video generation methods often struggle to capture long-term spatial and temporal dependencies in videos, resulting in poor temporal and spatial consistency and coherence in the generated videos. 2. **Computational efficiency**: Generating high-resolution videos requires substantial computational resources, and existing methods typically have high computational complexity, especially when dealing with long videos. 3. **Visual artifacts**: Autoregressive models tend to accumulate errors when generating videos, leading to visual artifacts in the generated videos. To address these challenges, the paper proposes a new video generation model called RA VEN (Rethinking Adversarial Video Generation with Efficient Tri-plane Networks), which addresses the above issues through the following approaches: - **Tri-plane representation**: Introduces a hybrid explicit-implicit tri-plane representation to capture long-term spatiotemporal dependencies in videos. This method can significantly reduce computational complexity and improve the quality of the generated videos. - **Single latent code**: Uses a single latent code to generate an entire video segment rather than generating frame by frame, thereby reducing computational complexity and error accumulation. - **Optical flow module**: Integrates an optical flow-based module into the generator architecture of the Generative Adversarial Network (GAN) to compensate for the limitations brought by smaller generator sizes and enhance the representation of motion. - **Efficient generator design**: Designs an efficient generator architecture capable of handling extended video sequences while maintaining computational efficiency. Through these innovations, RA VEN can generate high-quality, high-resolution videos at lower computational costs and supports frame extrapolation and interpolation during testing, significantly improving the effectiveness of video generation.

RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks

3D-Aware Image Synthesis Via Learning Structural and Textural Representations

Autoencoding Video Latents for Adversarial Video Generation

3DAttGAN: A 3D Attention-based Generative Adversarial Network for Joint Space-Time Video Super-Resolution

TriPlaneNet: An Encoder for EG3D Inversion

Towards Smooth Video Composition

Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks

Diverse Video Generation from a Single Video

Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs

REDUCIO! Generating 1024$\times$1024 Video within 16 Seconds using Extremely Compressed Motion Latents

Scaling Autoregressive Video Models

Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars

Latent Video Diffusion Models for High-Fidelity Long Video Generation

REDUCIO! Generating 1024×1024 Video Within 16 Seconds Using Extremely Compressed Motion Latents

Coherent3D: Coherent 3D Portrait Video Reconstruction via Triplane Fusion

Pyramidal Flow Matching for Efficient Video Generative Modeling

INR-V: A Continuous Representation Space for Video-based Generative Tasks

Video Probabilistic Diffusion Models in Projected Latent Space

Latent Neural Differential Equations for Video Generation