Abstract:Video inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers. Although these models show promising reconstruction quality and temporal consistency, they are still unsuitable for live videos, one of the last steps to make them completely convincing and usable. The main limitations are that these state-of-the-art models inpaint using the whole video (offline processing) and show an insufficient frame rate. In our approach, we propose a framework to adapt existing inpainting transformers to these constraints by memorizing and refining redundant computations while maintaining a decent inpainting quality. Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second. The code and pretrained models will be made available upon acceptance.

What problem does this paper attempt to address?

This paper attempts to solve the problem of online real - time video inpainting. Specifically, although the current state - of - the - art video inpainting models perform well in terms of reconstruction quality and temporal consistency, they are still not suitable for processing live - stream videos because these models require offline processing of the entire video and have an insufficient frame rate. This limits their application in practical scenarios, especially for the increasing amount of live - stream content (such as cultural and sports events, social media live - streams, etc.). Therefore, this paper proposes a framework to adapt the existing inpainting transformer models so that they can achieve online real - time processing while maintaining good inpainting quality. ### Main contributions of the paper: 1. **Online processing**: Explore how to make any inpainting model work online, that is, use only past information to repair the current frame instead of relying on future frames. Although this method sacrifices some inpainting quality, it provides a baseline for online processing. 2. **Memory mechanism**: Introduce a memory mechanism to save and reuse the calculation results of the previous frame, thereby reducing the amount of calculation for subsequent frames. This increases the frame rate by 3 times, meeting the standard for real - time processing, but further sacrifices some quality. 3. **Refined memory mechanism**: Further optimize the memory mechanism by having two models work together. One model repairs the current frame in real - time, and the other model re - repairs the past frames and passes the results to the first model, thereby improving the overall inpainting quality while maintaining real - time processing ability. ### Experimental results: - **Quantitative evaluation**: Use four metrics, namely PSNR, SSIM, VFID, and Ewarp, to evaluate the performance of the model. The experimental results show that although the online model has a decrease in quality, it has a significant increase in frame rate, especially after combining with the memory mechanism, the frame rate meets the requirements for real - time processing. - **Comparison of different models**: By adjusting the size of the input window, draw a quality/speed curve (see Figure 4), showing the performance of different models under different input sizes. This helps users choose the appropriate model according to actual needs. ### Conclusion: This paper successfully proposes a framework that enables the existing inpainting transformer models to achieve online real - time processing while maintaining a relatively high inpainting quality. This is of great significance for processing live - stream content, especially in scenarios such as cultural and sports events and social media live - streams.

Towards Online Real-Time Memory-based Video Inpainting Transformers

RT-VENet: A Convolutional Network for Real-time Video Enhancement.

Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization

DeViT: Deformed Vision Transformers in Video Inpainting

Learning Joint Spatial-Temporal Transformations for Video Inpainting

Transformer-based Image and Video Inpainting: Current Challenges and Future Directions

Frame-Recurrent Video Inpainting by Robust Optical Flow Inference

Recurrent Temporal Aggregation Framework for Deep Video Inpainting

DLFormer: Discrete Latent Transformer for Video Inpainting

RetinaViT: Efficient Visual Backbone for Online Video Streams

WTVI: A Wavelet-Based Transformer Network for Video Inpainting

FSTT: Flow-Guided Spatial Temporal Transformer for Deep Video Inpainting

Video Inpainting of Complex Scenes

Learnable Gated Temporal Shift Module for Deep Video Inpainting

A low-latency inpainting method for unstably transmitted videos

ProPainter: Improving Propagation and Transformer for Video Inpainting

A Temporally-Aware Interpolation Network for Video Frame Inpainting

Progressive Temporal Feature Alignment Network for Video Inpainting

Dynamic Graph Memory Bank for Video Inpainting

Decoupled Spatial-Temporal Transformer for Video Inpainting

Feature Pre-Inpainting Enhanced Transformer for Video Inpainting