AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation

Zhen Li,Zuo-Liang Zhu,Ling-Hao Han,Qibin Hou,Chun-Le Guo,Ming-Ming Cheng
2023-04-20
Abstract:We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video frame interpolation. It is based on two essential designs. First, we build bidirectional correlation volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations for updating both flows and the interpolated content feature. Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately. Combining these two designs enables us to generate promising task-oriented flows and reduce the difficulties in modeling large motions and handling occluded areas during frame interpolation. These qualities promote our model to achieve state-of-the-art performance on various benchmarks with high efficiency. Moreover, our convolution-based model competes favorably compared to Transformer-based models in terms of accuracy and efficiency. Our code is available at <a class="link-external link-https" href="https://github.com/MCG-NKU/AMT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address two main challenges in Video Frame Interpolation (VFI): handling large motions and dealing with occluded areas. The authors propose a novel network architecture named "All-Pairs Multi-Field Transforms" (AMT), to enhance the performance and efficiency of the video frame interpolation task. The main contributions of AMT are twofold: 1. It establishes bidirectional correlation volumes for all pixel pairs and utilizes predicted bilateral flows to update the flow and interpolated content features, thereby enhancing the fidelity of flow estimation. 2. It derives multiple sets of refined flow fields from a pair of updated coarse flows, which are used to backward warp the input frames, thus interpolating frames at the target time step. This method improves the capability to handle occluded areas. With these designs, AMT is able to generate high-quality task-oriented flows, reducing the difficulties in modeling large motions and occluded areas, thereby achieving state-of-the-art performance in various benchmarks while maintaining high efficiency. Moreover, compared to Transformer-based models, AMT shows superior performance in both accuracy and efficiency, especially in terms of the number of parameters and floating-point operations (FLOPs). Experimental results indicate that the small model of AMT (AMT-S) surpasses IFRNet-B by 0.17dB PSNR on the Vimeo90K dataset, while only having 60% of its FLOPs and parameter count. For the large-scale setting, AMT-L exceeds IFRNet-L by 0.15dB PSNR on the Vimeo90K dataset, with only 75% and 65% of its FLOPs and parameter count, respectively. In addition, AMT demonstrates significant advantages over Transformer-based VFI models such as VFIFormer and EMA-VFI in terms of accuracy, efficiency, and the number of parameters and inference speed.