ITrans: generative image inpainting with transformers

Wei Miao,Lijun Wang,Huchuan Lu,Kaining Huang,Xinchu Shi,Bocong Liu
DOI: https://doi.org/10.1007/s00530-023-01211-w
IF: 3.9
2024-01-19
Multimedia Systems
Abstract:Despite significant improvements, convolutional neural network (CNN) based methods are struggling with handling long-range global image dependencies due to their limited receptive fields, leading to an unsatisfactory inpainting performance under complicated scenarios. To address this issue, we propose the Inpainting Transformer (ITrans) network, which combines the power of both self-attention and convolution operations. The ITrans network augments convolutional encoder–decoder structure with two novel designs, i.e. , the global and local transformers. The global transformer aggregates high-level image context from the encoder in a global perspective, and propagates the encoded global representation to the decoder in a multi-scale manner. Meanwhile, the local transformer is intended to extract low-level image details inside the local neighborhood at a reduced computational overhead. By incorporating the above two transformers, ITrans is capable of both global relationship modeling and local details encoding, which is essential for hallucinating perceptually realistic images. Extensive experiments demonstrate that the proposed ITrans network outperforms favorably against state-of-the-art inpainting methods both quantitatively and qualitatively.
computer science, information systems, theory & methods
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the image inpainting task, traditional Convolutional Neural Network (CNN) methods, due to their limited receptive fields, are difficult to handle long - distance global image dependencies, resulting in poor inpainting performance in complex scenes. Specifically, the paper points out: - **Problem Background**: Image inpainting (or image completion) refers to the task of filling in missing pixels in an image to generate a complete image. This task has applications in various image editing fields, such as object removal, image restoration, photo retouching, etc. Before the deep - learning era, such tasks were mainly carried out by using existing image patches to fill in occluded areas. However, these methods lack semantic understanding and have thus been replaced by methods based on deep neural networks. - **Limitations of Existing Methods**: Although CNN - based methods perform well in generating details, their limited receptive fields are not sufficient to obtain the information required for high - quality inpainting, especially in complex scenes, which leads to unwanted artifacts and blurry results. - **New Challenges**: Recently, Transformer models have demonstrated record - breaking performance in various computer vision tasks, especially in modeling long - distance dependencies. However, Transformers lack inductive bias, which poses challenges when they are processing images. Although Transformers have a higher performance ceiling than CNNs, they are more difficult to learn due to complex pre - training requirements. To solve the above problems, the authors propose the Inpainting Transformer (ITrans) network, aiming to combine the advantages of CNNs and Transformers to improve the quality of image inpainting. Specifically, the ITrans network enhances the ability to model global relationships and local details by introducing global Transformer and local Transformer modules, thereby being able to generate perceptually more realistic images.