Abstract:Diffusion models have enabled the generation of high-quality images with a strong focus on realism and textual fidelity. Yet, large-scale text-to-image models, such as Stable Diffusion, struggle to generate images where foreground objects are placed over a chroma key background, limiting their ability to separate foreground and background elements without fine-tuning. To address this limitation, we present a novel Training-Free Chroma Key Content Generation Diffusion Model (TKG-DM), which optimizes the initial random noise to produce images with foreground objects on a specifiable color background. Our proposed method is the first to explore the manipulation of the color aspects in initial noise for controlled background generation, enabling precise separation of foreground and background without fine-tuning. Extensive experiments demonstrate that our training-free method outperforms existing methods in both qualitative and quantitative evaluations, matching or surpassing fine-tuned models. Finally, we successfully extend it to other tasks (e.g., consistency models and text-to-video), highlighting its transformative potential across various generative applications where independent control of foreground and background is crucial.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the difficulties encountered by existing large - scale text - to - image generation models (such as Stable Diffusion) when generating images with chroma key backgrounds. Specifically, these models have difficulty separating foreground objects from background elements without fine - tuning, and are unable to generate high - quality images with a specified - color background. This limits their use in applications such as advertising, design, and game development that require precise control of the foreground and background. To solve this problem, the author proposes a new method named **Training - Free Chroma Key Content Generation Diffusion Model (TKG - DM)**. This method generates an image with a specified - color background with foreground objects by optimizing the initial random noise, without the need for fine - tuning. The following are the main features of this method: 1. **No Fine - Tuning Required**: TKG - DM does not rely on additional datasets or the fine - tuning process, thereby reducing computational costs and resource requirements. 2. **Precise Control of Background Color**: By manipulating the color aspect of the initial noise, an image with a specific - color background can be generated, achieving precise separation of the foreground and background. 3. **High Flexibility**: Users can flexibly control the background color, layout, size, and the number of foreground objects. 4. **Strong Scalability**: This method is not only applicable to text - to - image generation tasks, but can also be extended to tasks such as conditional text - to - image generation, consistency models, and text - to - video generation. ### Method Overview The core idea of TKG - DM is to manipulate the initial noise through **Channel Mean Shift** to control the background color of the generated image. The specific steps are as follows: 1. **Channel Mean Shift**: - Calculate the initial positive proportion of each channel in the initial noise \( z_T\in\mathbb{R}^{h\times w\times4} \): \[ \text{InitialRatio}_c=\frac{\sum_{i,j}1(z_T^{(c)}(i,j)>0)}{\text{TotalPixels}_c} \] - Given the target shift \( \text{TargetShift}_c \), calculate the target positive proportion: \[ \text{TargetRatio}_c = \text{InitialRatio}_c+\text{TargetShift}_c \] - By iteratively adjusting the mean shift \( \Delta_c \) of each channel so that the positive proportion reaches the target value, finally obtain the initialized color noise \( z_T^* = F_c(z_T)=z_T+\Delta_c \). 2. **Initial Noise Selection**: - Use a two - dimensional Gaussian mask \( A(i,j) \) to combine the initial noise \( z_T \) and the initialized color noise \( z_T^* \) to generate a chroma - key image with a foreground \( x_0^{\text{key}} \): \[ z_T^{\text{key}}(i,j)=A(i,j)\cdot z_T(i,j)+(1 - A(i,j))\cdot z_T^*(i,j) \] - Among them, the Gaussian mask parameters \( (\mu_i,\mu_j,\sigma) \) control the position and size of the foreground. ### Experimental Results Experiments show that TKG - DM outperforms existing methods in both quantitative and qualitative evaluations, especially in generating high - precision chroma - key background images. In addition, this method can match or even surpass fine - tuned models at a lower computational cost. ### Application Expansion TKG - DM also demonstrates its broad application potential in multiple tasks, including but not limited to:

TKG-DM: Training-free Chroma Key Content Generation Diffusion Model

Training-Free Sketch-Guided Diffusion with Latent Optimization

DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

DiffColor: Toward High Fidelity Text-Guided Image Colorization with Diffusion Models

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

ColorEdit: Training-free Image-Guided Color editing with diffusion model

Test-time Conditional Text-to-Image Synthesis Using Diffusion Models

Pix2Video: Video Editing using Image Diffusion

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

From Text to Pose to Image: Improving Diffusion Model Control and Quality

Video Colorization with Pre-trained Text-to-Image Diffusion Models

Controlled Training Data Generation with Diffusion Models