Abstract:Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images. In this paper, we present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion (SD) to address challenges such as misinterpreted styles and inconsistent semantics. Our approach introduces two innovative modules: the mixed style descriptor and the dynamic attention adapter. The mixed style descriptor enhances SD by combining content-aware and frequency-disentangled embeddings from CLIP with additional sources that capture global statistics and textual information, thus providing a richer blend of style-related and semantic-related knowledge. To achieve a better balance between adapter capacity and semantic control, the dynamic attention adapter is integrated into the diffusion UNet, dynamically calculating adaptation weights based on the style descriptors. Additionally, we introduce two objective functions to optimize the model alongside the denoising loss, further enhancing semantic and style consistency. Extensive experiments demonstrate the superiority of ArtWeaver over existing methods, producing images with diverse target styles while maintaining the semantic integrity of the text prompts.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main challenges in **Stylized Text - to - Image Generation (STIG)**: 1. **Misinterpreted Styles**: Existing methods cannot fully capture the complex artistic styles in the reference images when generating images, resulting in the generated images not matching the expected styles. 2. **Inconsistent Semantics**: In existing methods, elements from the reference images will unduly affect the output images, resulting in the generated images being inconsistent with the content of the text prompts. To solve these problems, the author proposes a new framework, **ArtWeaver**. ArtWeaver improves the style embedding extraction and injection process by introducing two innovative modules: - **Mixed Style Descriptor (MSD)**: This module combines content - aware and frequency - decoupled embeddings, as well as global statistics and text information, providing a richer representation of style and semantic knowledge. - **Dynamic Attention Adapter (DAA)**: This module dynamically calculates adaptation weights according to the style descriptor, ensuring a better balance between adaptation ability and semantic control in the diffusion UNet. In addition, the author also introduces two new objective functions - **Gram - consistency loss** and **semantic - disentanglement loss** - to further enhance the style consistency and semantic consistency of the model. ### Formula summary 1. **Gram - consistency loss**: \[ L_{\text{style}}=\max\{\delta_p - \delta_n + 0.1,0\} \] where, \[ \delta_p=\sum|G(\phi_{\text{vgg}}(\hat{x}_0)) - G(\phi_{\text{vgg}}(I_{\text{pos}}))| \] \[ \delta_n=\sum|G(\phi_{\text{vgg}}(\hat{x}_0)) - G(\phi_{\text{vgg}}(I_{\text{neg}}))| \] 2. **Semantic - disentanglement loss**: \[ L_{\text{disen}}=\text{sim}(z_{\text{cap}}, z_s)-\delta\cdot\text{sim}(z_{\text{CLIP}}, z_s) \] where $\text{sim}$ represents cosine similarity and $\delta$ is a hyperparameter. 3. **Noise - prediction loss** (the loss function in the standard diffusion model): \[ L_{\text{noise}}=\mathbb{E}_{x,\epsilon\sim\mathcal{N}(0,1),t}[\|\epsilon - \epsilon_\theta(z_t,t)\|_2^2] \] Through these improvements, ArtWeaver can generate high - quality images with diverse target styles while maintaining the semantic integrity of the text prompts.

ArtWeaver: Advanced Dynamic Style Integration via Diffusion Model

ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

Style Fader Generative Adversarial Networks for Style Degree Controllable Artistic Style Transfer

Preserving Structural Consistency in Arbitrary Artist and Artwork Style Transfer

StyleAdapter: A Unified Stylized Image Generation Model

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

AesUST: Towards Aesthetic-Enhanced Universal Style Transfer

Arbitrary Style Guidance for Enhanced Diffusion-Based Text-to-Image Generation

DiffStyler: Diffusion-based Localized Image Style Transfer

StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

Inversion-Based Style Transfer with Diffusion Models

ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation

DiffuseST: Unleashing the Capability of the Diffusion Model for Style Transfer

Intelligent Typography: Artistic Text Style Transfer for Complex Texture and Structure

StyleForge: Enhancing Text-to-Image Synthesis for Any Artistic Styles with Dual Binding

FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models

DEADiff: an Efficient Stylization Diffusion Model with Disentangled Representations