FSTT: Flow-Guided Spatial Temporal Transformer for Deep Video Inpainting

Ruixin Liu,Yuesheng Zhu
DOI: https://doi.org/10.3390/electronics12214452
IF: 2.9
2023-01-01
Electronics
Abstract:Video inpainting aims to complete the missing regions with content that is consistent both spatially and temporally. How to effectively utilize the spatio-temporal information in videos is critical for video inpainting. Recent advances in video inpainting methods combine both optical flow and transformers to capture spatio-temporal information. However, these methods fail to fully explore the potential of optical flow within the transformer. Furthermore, the designed transformer block cannot effectively integrate spatio-temporal information across frames. To address the above problems, we propose a novel video inpainting model, named Flow-Guided Spatial Temporal Transformer (FSTT), which effectively establishes correspondences between missing regions and valid regions in both spatial and temporal dimensions under the guidance of completed optical flow. Specifically, a Flow-Guided Fusion Feed-Forward module is developed to enhance features with the assistance of optical flow, mitigating the inaccuracies caused by hole pixels when performing MHSA. Additionally, a decomposed spatio-temporal MHSA module is proposed to effectively capture spatio-temporal dependencies in videos. To improve the efficiency of the model, a Global–Local Temporal MHSA module is further designed based on the window partition strategy. Extensive quantitative and qualitative experiments on the DAVIS and YouTube-VOS datasets demonstrate the superiority of our proposed method.
What problem does this paper attempt to address?