Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting

Haipeng Liu,Yang Wang,Biao Qian,Meng Wang,Yong Rui
2024-04-01
Abstract:Denoising diffusion probabilistic models for image inpainting aim to add the noise to the texture of image during the forward process and recover masked regions with unmasked ones of the texture via the reverse denoising process. Despite the meaningful semantics generation, the existing arts suffer from the semantic discrepancy between masked and unmasked regions, since the semantically dense unmasked texture fails to be completely degraded while the masked regions turn to the pure noise in diffusion process, leading to the large discrepancy between them. In this paper, we aim to answer how unmasked semantics guide texture denoising process;together with how to tackle the semantic discrepancy, to facilitate the consistent and meaningful semantics generation. To this end, we propose a novel structure-guided diffusion model named StrDiffusion, to reformulate the conventional texture denoising process under structure guidance to derive a simplified denoising objective for image inpainting, while revealing: 1) the semantically sparse structure is beneficial to tackle semantic discrepancy in early stage, while dense texture generates reasonable semantics in late stage; 2) the semantics from unmasked regions essentially offer the time-dependent structure guidance for the texture denoising process, benefiting from the time-dependent sparsity of the structure semantics. For the denoising process, a structure-guided neural network is trained to estimate the simplified denoising objective by exploiting the consistency of the denoised structure between masked and unmasked regions. Besides, we devise an adaptive resampling strategy as a formal criterion as whether structure is competent to guide the texture denoising process, while regulate their semantic correlations. Extensive experiments validate the merits of StrDiffusion over the state-of-the-arts. Our code is available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily investigates the issue of semantic inconsistency in the field of image inpainting, particularly in methods based on diffusion models. Specifically, the paper points out that in the image inpainting process, existing methods like IR-SDE, when using the semantic information of the unmasked regions to guide texture denoising, face a challenge. The semantics in the unmasked areas are too dense, while the masked regions degrade to pure noise during the diffusion process, leading to a significant semantic discrepancy between the two. This difference limits the quality of the restoration results, especially in maintaining semantic consistency between the repaired area and the unmasked regions. To address the aforementioned problem, the paper introduces a new method called StrDiffusion. StrDiffusion simplifies the traditional texture denoising process by introducing a structure-guided diffusion model, aiming to generate semantics that are both consistent and meaningful. The core of this method is to use structures (such as edges or grayscale images) as an aid, whose semantic sparsity helps maintain semantic consistency in the early stages of denoising, while dense textures generate reasonable semantics in the later stages. Through this approach, StrDiffusion is able to balance semantic consistency and plausibility throughout the denoising process. Furthermore, the paper designs an adaptive resampling strategy to monitor the semantic relevance between structure and texture, adjusting as needed to enhance the guiding role of the structure. Experimental results show that StrDiffusion has a significant advantage in maintaining image semantic consistency when processing typical datasets compared to existing techniques. In summary, the paper aims to resolve the issue of semantic inconsistency in diffusion models within the image inpainting domain. By introducing a structure-guided diffusion model and an adaptive resampling strategy, it achieves higher quality image restoration effects.