Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

Zhiqi Huang,Huixin Xiong,Haoyu Wang,Longguang Wang,Zhiheng Li
2024-04-08
Abstract:Text-to-image generation has witnessed great progress, especially with the recent advancements in diffusion models. Since texts cannot provide detailed conditions like object appearance, reference images are usually leveraged for the control of objects in the generated images. However, existing methods still suffer limited accuracy when the relationship between the foreground and background is complicated. To address this issue, we develop a framework termed Mask-ControlNet by introducing an additional mask prompt. Specifically, we first employ large vision models to obtain masks to segment the objects of interest in the reference image. Then, the object images are employed as additional prompts to facilitate the diffusion model to better understand the relationship between foreground and background regions during image generation. Experiments show that the mask prompts enhance the controllability of the diffusion model to maintain higher fidelity to the reference image while achieving better image quality. Comparison with previous text-to-image generation methods demonstrates our method's superior quantitative and qualitative performance on the benchmark datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the issues present in text-to-image generation and proposes a new solution. Specifically, although existing text-to-image generation methods have made significant progress, especially with the help of diffusion models, there are still some limitations, particularly when dealing with complex scenes, such as object distortion, background overfitting, and disharmony between foreground and background. To solve these problems, the authors propose a framework called Mask-ControlNet. This framework introduces additional mask prompts to better control the relationship between foreground and background during the image generation process. The specific approach is as follows: 1. **Utilize large vision models to obtain masks**: First, a powerful vision model (such as SAM) is used to segment the reference image to obtain the mask of the object of interest. 2. **Separate foreground and background information**: The obtained mask is used to segment the foreground object from the reference image, and the foreground image is used as an additional prompt input to the model. 3. **Enhance the understanding ability of the diffusion model**: This additional information helps the diffusion model better understand the relationship between the foreground and background, thereby maintaining higher fidelity when generating images. Experimental results show that Mask-ControlNet can generate higher quality and more expected images, especially performing better in handling complex scenes. Compared with existing methods, this method achieves better performance in both quantitative and qualitative evaluations. Additionally, user studies also validate the superiority of Mask-ControlNet in terms of image realism, aesthetic quality, and accuracy.