Abstract:Video inpainting aims to fill the given spatiotemporal holes with realistic appearance but is still a challenging task even with prosperous deep learning approaches. Recent works introduce the promising Transformer architecture into deep video inpainting and achieve better performance. However, it still suffers from synthesizing blurry texture as well as huge computational cost. Towards this end, we propose a novel Decoupled Spatial-Temporal Transformer (DSTT) for improving video inpainting with exceptional efficiency. Our proposed DSTT disentangles the task of learning spatial-temporal attention into 2 sub-tasks: one is for attending temporal object movements on different frames at same spatial locations, which is achieved by temporally-decoupled Transformer block, and the other is for attending similar background textures on same frame of all spatial positions, which is achieved by spatiallydecoupled Transformer block. The interweaving stack of such two blocks makes our proposed model attend background textures and moving objects more precisely, and thus the attended plausible and temporally-coherent appearance can be propagated to fill the holes. In addition, a hierarchical encoder is adopted before the stack of Transformer blocks, for learning robust and hierarchical features that maintain multi-level local spatial structure, resulting in the more representative token vectors. Seamless combination of these two novel designs forms a better spatial-temporal attention scheme and our proposed model achieves better performance than state-of-theart video inpainting approaches with significant boosted efficiency. Training code and pretrained models are available at https://github.com/ruiliu-ai/DSTT.

WTVI: A Wavelet-Based Transformer Network for Video Inpainting

WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting

Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization

FSTT: Flow-Guided Spatial Temporal Transformer for Deep Video Inpainting

DeViT: Deformed Vision Transformers in Video Inpainting

Frequency-Aware Spatiotemporal Transformers for Video Inpainting Detection

Learning Joint Spatial-Temporal Transformations for Video Inpainting

ProPainter: Improving Propagation and Transformer for Video Inpainting

Decoupled Spatial-Temporal Transformer for Video Inpainting

WaveFill: A Wavelet-based Generation Network for Image Inpainting

Exploiting Optical Flow Guidance for Transformer-Based Video Inpainting

Progressive Temporal Feature Alignment Network for Video Inpainting

Dual-Tree Complex Wavelet Transform and Deep CNN-Based Super-Resolution for Video Inpainting with Application to Object Removal and Error Concealment

Flow-Guided Transformer for Video Inpainting

Video Inpainting by Jointly Learning Temporal Structure and Spatial Details

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

Towards Online Real-Time Memory-based Video Inpainting Transformers

Bidirectional Transformer for Video Deblurring

Frame-Recurrent Video Inpainting by Robust Optical Flow Inference

Learnable Gated Temporal Shift Module for Deep Video Inpainting

Align-and-Attend Network for Globally and Locally Coherent Video Inpainting