TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting

Taorong Liu,Liang Liao,Delin Chen,Jing Xiao,Zheng Wang,Chia-Wen Lin,Shin'ichi Satoh
2024-10-03
Abstract:Image inpainting for completing complicated semantic environments and diverse hole patterns of corrupted images is challenging even for state-of-the-art learning-based inpainting methods trained on large-scale data. A reference image capturing the same scene of a corrupted image offers informative guidance for completing the corrupted image as it shares similar texture and structure priors to that of the holes of the corrupted image. In this work, we propose a transformer-based encoder-decoder network, named TransRef, for reference-guided image inpainting. Specifically, the guidance is conducted progressively through a reference embedding procedure, in which the referencing features are subsequently aligned and fused with the features of the corrupted image. For precise utilization of the reference features for guidance, a reference-patch alignment (Ref-PA) module is proposed to align the patch features of the reference and corrupted images and harmonize their style differences, while a reference-patch transformer (Ref-PT) module is proposed to refine the embedded reference feature. Moreover, to facilitate the research of reference-guided image restoration tasks, we construct a publicly accessible benchmark dataset containing 50K pairs of input and reference images. Both quantitative and qualitative evaluations demonstrate the efficacy of the reference information and the proposed method over the state-of-the-art methods in completing complex holes. Code and dataset can be accessed at <a class="link-external link-https" href="https://github.com/Cameltr/TransRef" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is image inpainting in complex scenarios, especially dealing with corrupted images with complex semantic environments and diverse hole patterns. Even the most advanced learning - based inpainting methods trained on large - scale data have difficulty effectively应对 this challenge. Specifically, the authors point out that when the corrupted area involves multiple semantics, it becomes very difficult to rely solely on context information to map different semantics into a single context manifold, which usually results in blurred boundaries and incorrect semantic content. To overcome these problems, the authors propose a new reference - image - guided image inpainting method, that is, by introducing reference images that contain similar structures and details but diverse styles and geometries to help restore the corrupted scene. ### Core problems and solutions in the paper 1. **Limitations of existing methods**: - Existing learning - based inpainting methods perform poorly when dealing with complex scenarios, especially when the corrupted area is large or the semantics are complex. - Convolutional neural networks (CNNs), due to their limited receptive fields, have difficulty capturing long - distance dependencies, resulting in less - than - ideal inpainting results. 2. **Advantages of introducing reference images**: - Reference images provide texture and structure priors similar to those of the corrupted image and can more accurately guide the inpainting process. - By using the rich texture and structure information provided by reference images, the corrupted area can be restored more realistically, avoiding blurry or unreasonable inpainting results. 3. **Proposed solutions**: - A Transformer - based encoder - decoder network framework named TransRef is proposed for reference - image - guided image inpainting. - Two key modules are introduced: the Reference - Patch Alignment module (Ref - PA) and the Reference - Patch Transformer module (Ref - PT) to achieve local - to - global feature alignment and refinement. - A public benchmark dataset DPED50K is constructed, which contains 50,000 pairs of input and reference image pairs to promote the research on reference - image - guided inpainting tasks. ### Formula representation The formulas involved in the description are as follows: - Definition of the corrupted image \(I_m\): \[ I_m = I\odot(1 - M) \] where \(I\) is the original image, \(M\) is a binary mask matrix (0 represents known pixels, 1 represents missing pixels), and \(\odot\) represents element - wise multiplication. - Formulas for the encoder - decoder architecture: \[ z = Enc(I_m, M; \theta_{enc}) \] \[ \hat{I} = Dec(z; \theta_{dec}) \] where \(z\) is the latent feature vector encoded by the encoder \(Enc\), and \(\theta_{enc}\) and \(\theta_{dec}\) are the learnable parameters of the encoder and decoder respectively. - Formulas after introducing the reference image: \[ z = Enc(I_m, M, I_{ref}; \theta_{enc}) \] \[ \hat{I} = Dec(z; \theta_{dec}) \] where \(I_{ref}\) is the reference image, which contains textures and content similar to the original image \(I\) and can provide sufficient scene priors and compensate for the loss of texture details and structures. Through these improvements, TransRef can more effectively handle image inpainting tasks in complex scenarios and provide more realistic and accurate inpainting results.