Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance

Yulin Pan,Chaojie Mao,Zeyinzi Jiang,Zhen Han,Jingfeng Zhang
2024-03-29
Abstract:Prior studies have made significant progress in image inpainting guided by either text or subject image. However, the research on editing with their combined guidance is still in the early stages. To tackle this challenge, we present LAR-Gen, a novel approach for image inpainting that enables seamless inpainting of masked scene images, incorporating both the textual prompts and specified subjects. Our approach adopts a coarse-to-fine manner to ensure subject identity preservation and local semantic coherence. The process involves (i) Locate: concatenating the noise with masked scene image to achieve precise regional editing, (ii) Assign: employing decoupled cross-attention mechanism to accommodate multi-modal guidance, and (iii) Refine: using a novel RefineNet to supplement subject details. Additionally, to address the issue of scarce training data, we introduce a novel data construction pipeline. This pipeline extracts substantial pairs of data consisting of local text prompts and corresponding visual instances from a vast image dataset, leveraging publicly available large models. Extensive experiments and varied application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency. Project page can be found at \url{
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing a new challenge in the field of image inpainting: how to achieve customized image inpainting by leveraging the joint guidance of text descriptions and specific object images. Specifically, existing image inpainting methods either rely solely on text descriptions or on object images, while research on utilizing both types of information simultaneously for image inpainting is still in its infancy. To tackle this challenge, the research team proposed a new method called LAR-Gen. LAR-Gen is an innovative framework for image inpainting that can seamlessly integrate any customized object into a specified location within a scene image and allows users to finely control the process through text prompts. The design of LAR-Gen follows three core mechanisms: 1. **Locate**: Connects noise with the occluded scene image and the mask image to ensure that the model can accurately perform inpainting in the specified area while keeping the background unchanged. 2. **Assign**: Employs a decoupled cross-attention mechanism to handle multi-modal guiding information (text and object images), ensuring that the inpainting process aligns with the local description's semantics and the coarse-grained object reference. 3. **Refine**: Introduces a novel RefineNet to supplement object details, ensuring that the inpainted image maintains high fidelity and object details. Additionally, to overcome the issue of scarce training data, the researchers proposed a new data construction strategy that can automatically extract the required quadruple data (scene image, scene mask, object image, and text prompt) from large-scale image datasets. Experimental results show that LAR-Gen not only excels in maintaining object identity consistency and text semantic consistency but also serves as a unified framework supporting both text-guided and object image-guided image inpainting tasks.