Abstract:We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video frame interpolation. It is based on two essential designs. First, we build bidirectional correlation volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations for updating both flows and the interpolated content feature. Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately. Combining these two designs enables us to generate promising task-oriented flows and reduce the difficulties in modeling large motions and handling occluded areas during frame interpolation. These qualities promote our model to achieve state-of-the-art performance on various benchmarks with high efficiency. Moreover, our convolution-based model competes favorably compared to Transformer-based models in terms of accuracy and efficiency. Our code is available at <a class="link-external link-https" href="https://github.com/MCG-NKU/AMT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper aims to address two main challenges in Video Frame Interpolation (VFI): handling large motions and dealing with occluded areas. The authors propose a novel network architecture named "All-Pairs Multi-Field Transforms" (AMT), to enhance the performance and efficiency of the video frame interpolation task. The main contributions of AMT are twofold: 1. It establishes bidirectional correlation volumes for all pixel pairs and utilizes predicted bilateral flows to update the flow and interpolated content features, thereby enhancing the fidelity of flow estimation. 2. It derives multiple sets of refined flow fields from a pair of updated coarse flows, which are used to backward warp the input frames, thus interpolating frames at the target time step. This method improves the capability to handle occluded areas. With these designs, AMT is able to generate high-quality task-oriented flows, reducing the difficulties in modeling large motions and occluded areas, thereby achieving state-of-the-art performance in various benchmarks while maintaining high efficiency. Moreover, compared to Transformer-based models, AMT shows superior performance in both accuracy and efficiency, especially in terms of the number of parameters and floating-point operations (FLOPs). Experimental results indicate that the small model of AMT (AMT-S) surpasses IFRNet-B by 0.17dB PSNR on the Vimeo90K dataset, while only having 60% of its FLOPs and parameter count. For the large-scale setting, AMT-L exceeds IFRNet-L by 0.15dB PSNR on the Vimeo90K dataset, with only 75% and 65% of its FLOPs and parameter count, respectively. In addition, AMT demonstrates significant advantages over Transformer-based VFI models such as VFIFormer and EMA-VFI in terms of accuracy, efficiency, and the number of parameters and inference speed.

AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation

Multiframe Interpolation for Video Using Phase Features

Frame Interpolation Using Phase and Amplitude Feature Pyramids

Motion-Aware Video Frame Interpolation

Video Frame Interpolation with Flow Transformer

Multi-Level Video Frame Interpolation: Exploiting the Interaction among Different Levels

Fast Algorithm and Architecture Design for H.264/AVC Multiple Transforms

Video Frame Interpolation with Many-to-many Splatting and Spatial Selective Refinement

Many-to-many Splatting for Efficient Video Frame Interpolation

Video Frame Interpolation with Densely Queried Bilateral Correlation

Depth-Aware Video Frame Interpolation

Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation

MSEConv: A Unified Warping Framework for Video Frame Interpolation

TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation

Multi-Frame Pyramid Refinement Network for Video Frame Interpolation.

Cross-Attention Transformer for Video Interpolation

Dynamic Frame Interpolation in Wavelet Domain

Multi-Scale Video Frame-Synthesis Network with Transitive Consistency Loss

Exploring Neighbor Correspondence Matching for Multiple-Hypotheses Video Frame Synthesis

Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation

H-VFI: Hierarchical Frame Interpolation for Videos with Large Motions