Perception-Oriented Video Frame Interpolation via Asymmetric Blending

Guangyang Wu,Xin Tao,Changlin Li,Wenyi Wang,Xiaohong Liu,Qingqing Zheng
2024-04-10
Abstract:Previous methods for Video Frame Interpolation (VFI) have encountered challenges, notably the manifestation of blur and ghosting effects. These issues can be traced back to two pivotal factors: unavoidable motion errors and misalignment in supervision. In practice, motion estimates often prove to be error-prone, resulting in misaligned features. Furthermore, the reconstruction loss tends to bring blurry results, particularly in misaligned regions. To mitigate these challenges, we propose a new paradigm called PerVFI (Perception-oriented Video Frame Interpolation). Our approach incorporates an Asymmetric Synergistic Blending module (ASB) that utilizes features from both sides to synergistically blend intermediate features. One reference frame emphasizes primary content, while the other contributes complementary information. To impose a stringent constraint on the blending process, we introduce a self-learned sparse quasi-binary mask which effectively mitigates ghosting and blur artifacts in the output. Additionally, we employ a normalizing flow-based generator and utilize the negative log-likelihood loss to learn the conditional distribution of the output, which further facilitates the generation of clear and fine details. Experimental results validate the superiority of PerVFI, demonstrating significant improvements in perceptual quality compared to existing methods. Codes are available at \url{
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the common blurring and ghosting problems in the video frame interpolation (VFI) task. These problems mainly stem from two factors: inevitable motion errors and temporal supervision misalignment. Specifically: 1. **Inevitable motion errors**: Ideally, satisfactory results can be obtained through accurate motion estimation. However, in practical applications, especially when dealing with large - scale motion, it is very difficult to achieve error - free pixel - level correspondence. This leads to inaccurate feature alignment, which in turn affects the quality of the finally generated intermediate frames. 2. **Temporal supervision misalignment**: During the training phase, the ground - truth (GT) intermediate frames only provide references at specific time points. But in natural videos, there may be multiple potential solutions within the time interval between two frames. Therefore, the intermediate features learned from different training videos may be different, resulting in the network generating blurry results. To solve the above problems, the authors propose a new perception - oriented video frame interpolation method (PerVFI). The main innovations of PerVFI include: - **Asymmetric Synergistic Blending module (ASB)**: Utilize features from both sides for synergistic blending, where one reference frame emphasizes the main content and the other reference frame provides supplementary information. To strictly control the fusion process, a self - learning sparse quasi - binary mask is introduced, which effectively reduces ghosting and blurring artifacts in the output. - **Normalized - flow - based generator**: Use a normalized - flow - based generator to decode intermediate features. This generator models the conditional distribution of the output based on the reference input, further promoting the generation of clear details. Compared with GAN - based methods and diffusion - based methods, the normalized - flow - based method is more stable during the training process and has lower latency during inference. The experimental results verify the significant advantages of PerVFI in perceptual quality. In particular, when dealing with large - scale motion and temporal supervision misalignment, the generated intermediate frames have higher visual quality.