WTVI: A Wavelet-Based Transformer Network for Video Inpainting

Ke Zhang,Guanxiao Li,Yu Su,Jingyu Wang
DOI: https://doi.org/10.1109/lsp.2024.3361805
2024-02-23
IEEE Signal Processing Letters
Abstract:Video inpainting aims to complete missing frames visually convincingly by balancing high-frequency detailed textures and low-frequency semantic structures. Conventional approaches utilize generative adversarial and reconstruction losses for optimizing output frames, each favoring different frequency aspects, to achieve this equilibrium. However, employing both loss types concurrently often results in a conflict between perceptual and distortion qualities, mainly due to their distinct frequency preferences. In response, this letter introduces the Wavelet-based Transformer network for Video Inpainting (WTVI). WTVI employs a 2D discrete wavelet transform (DWT) to decompose frames into various frequency bands, ensuring the preservation of spatial information. It then independently completes missing regions in each band using Transformer network. To mitigate inter-frequency conflicts, we apply reconstruction loss to the low-frequency bands and adversarial loss to the high-frequency bands. Additionally, we innovate High-frequency Cross-Attention (HCA) and Low-frequency Cross-Attention (LCA) modules to enhance frequency dependency learning beyond the spatial-temporal scope and to align features across bands. Our experiments confirm that WTVI surpasses previous methods, significantly improving both quantitative and qualitative performance.
engineering, electrical & electronic
What problem does this paper attempt to address?