Abstract:Video generation has made remarkable progress in recent years, especially since the advent of the video diffusion models. Many video generation models can produce plausible synthetic videos, e.g., Stable Video Diffusion (SVD). However, most video models can only generate low frame rate videos due to the limited GPU memory as well as the difficulty of modeling a large set of frames. The training videos are always uniformly sampled at a specified interval for temporal compression. Previous methods promote the frame rate by either training a video interpolation model in pixel space as a postprocessing stage or training an interpolation model in latent space for a specific base video model. In this paper, we propose a training-free video interpolation method for generative video diffusion models, which is generalizable to different models in a plug-and-play manner. We investigate the non-linearity in the feature space of video diffusion models and transform a video model into a self-cascaded video diffusion model with incorporating the designed hidden state correction modules. The self-cascaded architecture and the correction module are proposed to retain the temporal consistency between key frames and the interpolated frames. Extensive evaluations are preformed on multiple popular video models to demonstrate the effectiveness of the propose method, especially that our training-free method is even comparable to trained interpolation models supported by huge compute resources and large-scale datasets.

What problem does this paper attempt to address?

This paper focuses on improving the frame rate of existing video generation models, particularly those based on diffusion models, in order to generate smoother and higher frame rate videos. Existing video models often generate low frame rate videos due to GPU memory constraints and the difficulty of modeling long sequence frames. The paper proposes a video interpolation method called "ZeroSmooth" that does not require additional training data or parameter updates. It can be used as a plug-and-play solution for different video models. The core of the method is to transform the video model into a self-cascading structure, consisting of two branches: one for short video inference and another for adapting long video inference using a hidden state correction module to maintain temporal consistency between key frames and interpolated frames. The hidden state correction module utilizes hidden states from both branches to calibrate the hidden state of the long branch, enhancing content control and inter-frame consistency. In addition, the paper designs a strategy to control the strength of the correction. Experiments show that ZeroSmooth performs effectively on multiple popular video models, especially compared to trained interpolation models that require a large amount of computational resources and support from large-scale datasets. The paper also compares it with training baseline methods, including direct inference and training-based video interpolation methods, and conducts ablation studies, demonstrating the importance of the proposed components for generating high-quality and high-frame rate videos. In summary, this paper aims to address the challenge of improving the frame rate of pre-trained video generation models without retraining or using additional data, in order to generate smoother high-frame rate videos.

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

Efficient and consistent zero-shot video generation with diffusion models

Accelerating Video Diffusion Models via Distribution Matching

Latent Video Diffusion Models for High-Fidelity Long Video Generation

From Slow Bidirectional to Fast Causal Video Generators

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Video Diffusion Models

FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

SF-V: Single Forward Video Generation Model

Video Probabilistic Diffusion Models in Projected Latent Space

TVG: A Training-free Transition Video Generation Method with Diffusion Models

VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

Video Interpolation with Diffusion Models

Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models