Abstract:We present FaithFill, a diffusion-based inpainting object completion approach for realistic generation of missing object parts. Typically, multiple reference images are needed to achieve such realistic generation, otherwise the generation would not faithfully preserve shape, texture, color, and background. In this work, we propose a pipeline that utilizes only a single input reference image -having varying lighting, background, object pose, and/or viewpoint. The singular reference image is used to generate multiple views of the object to be inpainted. We demonstrate that FaithFill produces faithful generation of the object's missing parts, together with background/scene preservation, from a single reference image. This is demonstrated through standard similarity metrics, human judgement, and GPT evaluation. Our results are presented on the DreamBooth dataset, and a novel proposed dataset.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use a single reference image to achieve realistic inpainting of the missing parts of an object while maintaining the consistency of features such as the shape, texture, and color of the object and the background. Traditional methods usually require multiple reference images to generate realistic results; otherwise, the generated results may not faithfully preserve the shape, texture, color, and background information of the object. This paper proposes a method named FaithFill, aiming to achieve high - quality object completion that is faithful to the original image using only one reference image.
### Specific Problem Description
1. **Limitations of Existing Methods**:
- Most existing image inpainting methods rely on multiple reference images to generate realistic results, which is not always feasible in practical applications.
- Methods using a single reference image may result in generated results that are not faithful to the shape, texture, color, or background of the original image.
2. **Research Objectives**:
- Propose an image inpainting method FaithFill based on the diffusion model, which can achieve high - quality inpainting of the missing parts of an object using only one reference image.
- Ensure that the generated results are not only realistic but also faithfully preserve the features of the object and the background.
3. **Key Challenges**:
- How to extract sufficient information from a single reference image to generate object views from multiple perspectives.
- How to maintain the consistency of features such as the shape, texture, and color of the object and the background during the inpainting process.
### Solution Overview
To address the above challenges, FaithFill proposes the following solutions:
- **Multi - Perspective Generation Module**: Use the NeRF (Neural Radiance Field) model to generate object views from multiple different perspectives from a single reference image, thereby providing more perspective information.
- **Segmentation Module**: Use the Segment Anything Model (SAM) to extract the object of interest from the reference image and remove the background to ensure natural fusion.
- **Inpainting Module**: Combine the CLIP text encoder and the ControlNet adapter and perform inpainting through the U - Net denoiser to ensure the consistency of the inpainting area with the original image.
- **Low - Rank Adaptation Technique (LoRA)**: Adopt the LoRA technique to fine - tune the U - Net and the CLIP text encoder, reducing the computational cost and improving the generalization ability of the model.
Through the collaborative work of these modules, FaithFill can generate high - quality inpainting results that are faithful to the original image using only one reference image.
### Evaluation and Verification
The paper evaluates and verifies FaithFill in the following aspects:
- **Benchmark Datasets**: Conduct experiments on the DreamBooth dataset and the self - built FaithFill dataset.
- **Evaluation Metrics**: Use standard similarity measures (such as SSIM, PSNR, LPIPS, etc.), human judgment, and GPT evaluation for quantitative and qualitative evaluation.
- **User Studies**: Recruit participants through the Amazon Mechanical Turk platform to conduct large - scale human judgment experiments.
- **Comparative Experiments**: Compare with a variety of state - of - the - art methods (such as RePaint, GLIDE, Blended Latent Diffusion, Stable Inpainting, Paint - By - Example, LeftRefill, etc.) to verify the advantages of FaithFill.
In summary, the main contribution of this paper is to propose an image inpainting method FaithFill that requires only a single reference image and can generate high - quality inpainting results while maintaining the consistency of the object and background features.