Frequency-Controlled Diffusion Model for Versatile Text-Guided Image-to-Image Translation

Xiang Gao,Zhengbo Xu,Junhan Zhao,Jiaying Liu
DOI: https://doi.org/10.1609/aaai.v38i3.27951
2024-07-03
Abstract:Recently, large-scale text-to-image (T2I) diffusion models have emerged as a powerful tool for image-to-image translation (I2I), allowing open-domain image translation via user-provided text prompts. This paper proposes frequency-controlled diffusion model (FCDiffusion), an end-to-end diffusion-based framework that contributes a novel solution to text-guided I2I from a frequency-domain perspective. At the heart of our framework is a feature-space frequency-domain filtering module based on Discrete Cosine Transform, which filters the latent features of the source image in the DCT domain, yielding filtered image features bearing different DCT spectral bands as different control signals to the pre-trained Latent Diffusion Model. We reveal that control signals of different DCT spectral bands bridge the source image and the T2I generated image in different correlations (e.g., style, structure, layout, contour, etc.), and thus enable versatile I2I applications emphasizing different I2I correlations, including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. Different from related approaches, FCDiffusion establishes a unified text-guided I2I framework suitable for diverse image translation tasks simply by switching among different frequency control branches at inference time. The effectiveness and superiority of our method for text-guided I2I are demonstrated with extensive experiments both qualitatively and quantitatively. The code is publicly available at: <a class="link-external link-https" href="https://github.com/XiangGao1102/FCDiffusion" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the issue of Image-to-Image Translation (I2I), especially on how to utilize textual guidance for open-domain image translation. The paper introduces a novel approach—the Frequency-Controlled Diffusion Model (FCDiffusion), which is an end-to-end diffusion-based framework that offers a novel solution from the frequency domain perspective. ### Main Contributions and Objectives - **Problem Addressed**: Traditional I2I methods are limited to specific domains or require paired training data, whereas FCDiffusion aims to overcome these limitations to achieve open-domain, flexible, and diverse I2I translation tasks, such as style-guided content creation, image semantic manipulation, image scene translation, and image style translation. - **Technical Means**: At the core of FCDiffusion is a feature space frequency domain filtering module based on the Discrete Cosine Transform (DCT), which can filter the latent features of the source image and generate different frequency band features as control signals to guide the pre-trained Latent Diffusion Model (LDM). By using different frequency band features (such as low, mid, and high frequencies), it is possible to control different I2I relevancies, thus adapting to various application scenarios. - **Advantages**: - It can simply adapt to different I2I applications by switching different frequency domain control branches. - Integrates multiple scalable frequency control branches, allowing flexible switching between different I2I tasks within a single model. - The learning objectives are straightforward, with lower computational resource requirements, fast inference speed, while maintaining high-quality visual effects. ### Method Overview - **Architecture Components**: FCDiffusion mainly includes three parts: the pre-trained LDM, Frequency Filtering Module (FFM), and Frequency Control Network (FreqControlNet, FCNet). - **Frequency Filtering Module (FFM)**: Transforms the latent features of the source image into the frequency domain and extracts different frequency band features through the designed DCT filters, which are used as control signals. - **Frequency Control Network (FCNet)**: This is a trainable network used to control the denoising process of the LDM. It takes control signals, time steps, and text embeddings as input, and outputs multi-scale feature maps to guide the LDM in reconstructing the latent features of the source image. ### Experimental Results - The paper demonstrates high-quality results of FCDiffusion in various I2I tasks, including style-guided content creation, image semantic manipulation, image scene translation, and image style translation. - Compared to related methods, FCDiffusion is able to handle challenging I2I examples better, generating results that are both consistent with text prompts and maintain the original image style and structure. ### Conclusion In summary, FCDiffusion is an innovative method that not only solves the problem of open-domain image-to-image translation but also produces high-quality results in different application scenarios. The advantages of this method lie in its flexibility, efficiency, and high-quality output, making FCDiffusion a very promising I2I solution.