Towards Online Real-Time Memory-based Video Inpainting Transformers

Guillaume Thiry,Hao Tang,Radu Timofte,Luc Van Gool
2024-03-24
Abstract:Video inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers. Although these models show promising reconstruction quality and temporal consistency, they are still unsuitable for live videos, one of the last steps to make them completely convincing and usable. The main limitations are that these state-of-the-art models inpaint using the whole video (offline processing) and show an insufficient frame rate. In our approach, we propose a framework to adapt existing inpainting transformers to these constraints by memorizing and refining redundant computations while maintaining a decent inpainting quality. Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second. The code and pretrained models will be made available upon acceptance.
Computer Science
What problem does this paper attempt to address?
This paper attempts to solve the problem of online real - time video inpainting. Specifically, although the current state - of - the - art video inpainting models perform well in terms of reconstruction quality and temporal consistency, they are still not suitable for processing live - stream videos because these models require offline processing of the entire video and have an insufficient frame rate. This limits their application in practical scenarios, especially for the increasing amount of live - stream content (such as cultural and sports events, social media live - streams, etc.). Therefore, this paper proposes a framework to adapt the existing inpainting transformer models so that they can achieve online real - time processing while maintaining good inpainting quality. ### Main contributions of the paper: 1. **Online processing**: Explore how to make any inpainting model work online, that is, use only past information to repair the current frame instead of relying on future frames. Although this method sacrifices some inpainting quality, it provides a baseline for online processing. 2. **Memory mechanism**: Introduce a memory mechanism to save and reuse the calculation results of the previous frame, thereby reducing the amount of calculation for subsequent frames. This increases the frame rate by 3 times, meeting the standard for real - time processing, but further sacrifices some quality. 3. **Refined memory mechanism**: Further optimize the memory mechanism by having two models work together. One model repairs the current frame in real - time, and the other model re - repairs the past frames and passes the results to the first model, thereby improving the overall inpainting quality while maintaining real - time processing ability. ### Experimental results: - **Quantitative evaluation**: Use four metrics, namely PSNR, SSIM, VFID, and Ewarp, to evaluate the performance of the model. The experimental results show that although the online model has a decrease in quality, it has a significant increase in frame rate, especially after combining with the memory mechanism, the frame rate meets the requirements for real - time processing. - **Comparison of different models**: By adjusting the size of the input window, draw a quality/speed curve (see Figure 4), showing the performance of different models under different input sizes. This helps users choose the appropriate model according to actual needs. ### Conclusion: This paper successfully proposes a framework that enables the existing inpainting transformer models to achieve online real - time processing while maintaining a relatively high inpainting quality. This is of great significance for processing live - stream content, especially in scenarios such as cultural and sports events and social media live - streams.