Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

Zhiqi Huang,Huixin Xiong,Haoyu Wang,Longguang Wang,Zhiheng Li

2024-04-08

Abstract:Text-to-image generation has witnessed great progress, especially with the recent advancements in diffusion models. Since texts cannot provide detailed conditions like object appearance, reference images are usually leveraged for the control of objects in the generated images. However, existing methods still suffer limited accuracy when the relationship between the foreground and background is complicated. To address this issue, we develop a framework termed Mask-ControlNet by introducing an additional mask prompt. Specifically, we first employ large vision models to obtain masks to segment the objects of interest in the reference image. Then, the object images are employed as additional prompts to facilitate the diffusion model to better understand the relationship between foreground and background regions during image generation. Experiments show that the mask prompts enhance the controllability of the diffusion model to maintain higher fidelity to the reference image while achieving better image quality. Comparison with previous text-to-image generation methods demonstrates our method's superior quantitative and qualitative performance on the benchmark datasets.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily addresses the issues present in text-to-image generation and proposes a new solution. Specifically, although existing text-to-image generation methods have made significant progress, especially with the help of diffusion models, there are still some limitations, particularly when dealing with complex scenes, such as object distortion, background overfitting, and disharmony between foreground and background. To solve these problems, the authors propose a framework called Mask-ControlNet. This framework introduces additional mask prompts to better control the relationship between foreground and background during the image generation process. The specific approach is as follows: 1. **Utilize large vision models to obtain masks**: First, a powerful vision model (such as SAM) is used to segment the reference image to obtain the mask of the object of interest. 2. **Separate foreground and background information**: The obtained mask is used to segment the foreground object from the reference image, and the foreground image is used as an additional prompt input to the model. 3. **Enhance the understanding ability of the diffusion model**: This additional information helps the diffusion model better understand the relationship between the foreground and background, thereby maintaining higher fidelity when generating images. Experimental results show that Mask-ControlNet can generate higher quality and more expected images, especially performing better in handling complex scenes. Compared with existing methods, this method achieves better performance in both quantitative and qualitative evaluations. Additionally, user studies also validate the superiority of Mask-ControlNet in terms of image realism, aesthetic quality, and accuracy.

Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

ECNet: Effective Controllable Text-to-Image Diffusion Models

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

PainterNet: Adaptive Image Inpainting with Actual-Token Attention and Diverse Mask Control

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

Masked Diffusion Models Are Fast Distribution Learners

Local Conditional Controlling for Text-to-Image Diffusion Models

SPDiffusion: Semantic Protection Diffusion for Multi-concept Text-to-image Generation

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions

E-Commerce Inpainting with Mask Guidance in Controlnet for Reducing Overcompletion

Counting Guidance for High Fidelity Text-to-Image Synthesis

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

PromptFix: You Prompt and We Fix the Photo

From Text to Pose to Image: Improving Diffusion Model Control and Quality