Motion-aware Latent Diffusion Models for Video Frame Interpolation

Zhilin Huang,Yijie Yu,Ling Yang,Chujun Qin,Bing Zheng,Xiawu Zheng,Zikun Zhou,Yaowei Wang,Wenming Yang
2024-08-03
Abstract:With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, motion-aware latent diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the challenges in the task of Video Frame Interpolation (VFI), particularly how to generate clear, coherent, and visually natural intermediate frames in complex dynamic scenes. The authors propose a new diffusion model framework—Motion-Aware Latent Diffusion Models (MAD IFF), aiming to improve the quality of frame interpolation by effectively utilizing motion information between adjacent frames. Specifically, the key contributions and technical details of MAD IFF are as follows: 1. **Proposed a new Vector Quantized Motion-Aware Generative Adversarial Network (VQ-MAGAN)**: This network can fully incorporate inter-frame motion cues between the target interpolated frame and the given adjacent conditional frames into the prediction process. This is achieved by extracting event volumes as motion cues through a pre-trained EventGAN. 2. **Proposed a new Motion-Aware Sampling Process (MA-SAMPLING)**: To eliminate the discrepancy in extracting motion cues between the training phase and the sampling phase, making the motion cues usable during the sampling process and gradually refining the predicted interpolated frames, the authors designed this sampling process. During sampling, the coarse interpolated frame predicted at the previous time step is used to extract inter-frame motion cues, which are then input into VQ-MAGAN and the denoising U-Net for the current time step's prediction. 3. **Experimental Results**: Through extensive experiments on multiple VFI benchmark datasets, MAD IFF significantly outperforms existing methods, especially excelling in handling challenging scenes with complex dynamic textures. In summary, the main purpose of this paper is to address the difficulty of accurately predicting motion information in complex dynamic scenes faced by existing VFI methods by introducing a new diffusion model framework, thereby generating smoother and more realistic interpolated frames.