Abstract:Recently, deep learning-based image inpainting methods have made great strides in reconstructing damaged regions. However, these methods often struggle to produce satisfactory results when dealing with missing images with large holes, leading to distortions in the structure and blurring of textures. To address these problems, we combine the advantages of transformers and convolutions to propose an image inpainting method that incorporates edge priors and attention mechanisms. The proposed method aims to improve the results of inpainting large holes in images by enhancing the accuracy of structure restoration and the ability to recover texture details. This method divides the inpainting task into two phases: edge prediction and image inpainting. Specifically, in the edge prediction phase, a transformer architecture is designed to combine axial attention with standard self-attention. This design enhances the extraction capability of global structural features and location awareness. It also balances the complexity of self-attention operations, resulting in accurate prediction of the edge structure in the defective region. In the image inpainting phase, a multi-scale fusion attention module is introduced. This module makes full use of multi-level distant features and enhances local pixel continuity, thereby significantly improving the quality of image inpainting. To evaluate the performance of our method, comparative experiments are conducted on several datasets, including CelebA, Places2, and Facade. Quantitative experiments show that our method outperforms the other mainstream methods. Specifically, it improves Peak Signal-to-Noise Ratio (PSNR) and Structure Similarity Index Measure (SSIM) by 1.141~3.234 db and 0.083~0.235, respectively. Moreover, it reduces Learning Perceptual Image Patch Similarity (LPIPS) and Mean Absolute Error (MAE) by 0.0347~0.1753 and 0.0104~0.0402, respectively. Qualitative experiments reveal that our method excels at reconstructing images with complete structural information and clear texture details. Furthermore, our model exhibits impressive performance in terms of the number of parameters, memory cost, and testing time.

MAT: Mask-Aware Transformer for Large Hole Image Inpainting

Deep Transformer Based Video Inpainting Using Fast Fourier Tokenization

HINT: High-quality INpainting Transformer with Mask-Aware Encoding and Enhanced Attention

Delving Globally into Texture and Structure for Image Inpainting

Bridging partial-gated convolution with transformer for smooth-variation image inpainting

DeViT: Deformed Vision Transformers in Video Inpainting

Image Inpainting Technique Incorporating Edge Prior and Attention Mechanism

Image Inpainting by End-to-End Cascaded Refinement With Mask Awareness

Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding

NLKFill: high-resolution image inpainting with a novel large kernel attention

MxT: Mamba x Transformer for Image Inpainting

Sparse self-attention transformer for image inpainting

PIPformers: Patch based inpainting with vision transformers for generalize paintings

Learning Joint Spatial-Temporal Transformations for Video Inpainting

DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion

Effective Local-Global Transformer for Natural Image Matting

ITrans: generative image inpainting with transformers

TransMatting: Enhancing Transparent Objects Matting with Transformers

WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting

Inpainting Transformer for Anomaly Detection