Abstract:With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, motion-aware latent diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.

What problem does this paper attempt to address?

The paper primarily addresses the challenges in the task of Video Frame Interpolation (VFI), particularly how to generate clear, coherent, and visually natural intermediate frames in complex dynamic scenes. The authors propose a new diffusion model framework—Motion-Aware Latent Diffusion Models (MAD IFF), aiming to improve the quality of frame interpolation by effectively utilizing motion information between adjacent frames. Specifically, the key contributions and technical details of MAD IFF are as follows: 1. **Proposed a new Vector Quantized Motion-Aware Generative Adversarial Network (VQ-MAGAN)**: This network can fully incorporate inter-frame motion cues between the target interpolated frame and the given adjacent conditional frames into the prediction process. This is achieved by extracting event volumes as motion cues through a pre-trained EventGAN. 2. **Proposed a new Motion-Aware Sampling Process (MA-SAMPLING)**: To eliminate the discrepancy in extracting motion cues between the training phase and the sampling phase, making the motion cues usable during the sampling process and gradually refining the predicted interpolated frames, the authors designed this sampling process. During sampling, the coarse interpolated frame predicted at the previous time step is used to extract inter-frame motion cues, which are then input into VQ-MAGAN and the denoising U-Net for the current time step's prediction. 3. **Experimental Results**: Through extensive experiments on multiple VFI benchmark datasets, MAD IFF significantly outperforms existing methods, especially excelling in handling challenging scenes with complex dynamic textures. In summary, the main purpose of this paper is to address the difficulty of accurately predicting motion information in complex dynamic scenes faced by existing VFI methods by introducing a new diffusion model framework, thereby generating smoother and more realistic interpolated frames.

Motion-aware Latent Diffusion Models for Video Frame Interpolation

LDMVFI: Video Frame Interpolation with Latent Diffusion Models

Motion-Aware Video Frame Interpolation

Frame Interpolation with Consecutive Brownian Bridge Diffusion

IQ-VFI: Implicit Quadratic Motion Estimation for Video Frame Interpolation

Video Interpolation with Diffusion Models

Video Frame Interpolation with Densely Queried Bilateral Correlation

Disentangled Motion Modeling for Video Frame Interpolation

Generalizable Implicit Motion Modeling for Video Frame Interpolation

A Motion Distillation Framework for Video Frame Interpolation

MV-Diffusion: Motion-aware Video Diffusion Model

Boost Video Frame Interpolation via Motion Adaptation

Video Frame Interpolation without Temporal Priors

Perception-Oriented Video Frame Interpolation via Asymmetric Blending

Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation

Progressive Spatial-temporal Collaborative Network for Video Frame Interpolation

Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation

Video Frame Interpolation via Structure-Motion based Iterative Fusion

Dynamic Video Frame Interpolation with integrated Difficulty Pre-Assessment

BiM-VFI: directional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions

Dynamic Frame Interpolation in Wavelet Domain