ArtWeaver: Advanced Dynamic Style Integration via Diffusion Model

Chengming Xu,Kai Hu,Qilin Wang,Donghao Luo,Jiangning Zhang,Xiaobin Hu,Yanwei Fu,Chengjie Wang
2024-11-18
Abstract:Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images. In this paper, we present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion (SD) to address challenges such as misinterpreted styles and inconsistent semantics. Our approach introduces two innovative modules: the mixed style descriptor and the dynamic attention adapter. The mixed style descriptor enhances SD by combining content-aware and frequency-disentangled embeddings from CLIP with additional sources that capture global statistics and textual information, thus providing a richer blend of style-related and semantic-related knowledge. To achieve a better balance between adapter capacity and semantic control, the dynamic attention adapter is integrated into the diffusion UNet, dynamically calculating adaptation weights based on the style descriptors. Additionally, we introduce two objective functions to optimize the model alongside the denoising loss, further enhancing semantic and style consistency. Extensive experiments demonstrate the superiority of ArtWeaver over existing methods, producing images with diverse target styles while maintaining the semantic integrity of the text prompts.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main challenges in **Stylized Text - to - Image Generation (STIG)**: 1. **Misinterpreted Styles**: Existing methods cannot fully capture the complex artistic styles in the reference images when generating images, resulting in the generated images not matching the expected styles. 2. **Inconsistent Semantics**: In existing methods, elements from the reference images will unduly affect the output images, resulting in the generated images being inconsistent with the content of the text prompts. To solve these problems, the author proposes a new framework, **ArtWeaver**. ArtWeaver improves the style embedding extraction and injection process by introducing two innovative modules: - **Mixed Style Descriptor (MSD)**: This module combines content - aware and frequency - decoupled embeddings, as well as global statistics and text information, providing a richer representation of style and semantic knowledge. - **Dynamic Attention Adapter (DAA)**: This module dynamically calculates adaptation weights according to the style descriptor, ensuring a better balance between adaptation ability and semantic control in the diffusion UNet. In addition, the author also introduces two new objective functions - **Gram - consistency loss** and **semantic - disentanglement loss** - to further enhance the style consistency and semantic consistency of the model. ### Formula summary 1. **Gram - consistency loss**: \[ L_{\text{style}}=\max\{\delta_p - \delta_n + 0.1,0\} \] where, \[ \delta_p=\sum|G(\phi_{\text{vgg}}(\hat{x}_0)) - G(\phi_{\text{vgg}}(I_{\text{pos}}))| \] \[ \delta_n=\sum|G(\phi_{\text{vgg}}(\hat{x}_0)) - G(\phi_{\text{vgg}}(I_{\text{neg}}))| \] 2. **Semantic - disentanglement loss**: \[ L_{\text{disen}}=\text{sim}(z_{\text{cap}}, z_s)-\delta\cdot\text{sim}(z_{\text{CLIP}}, z_s) \] where $\text{sim}$ represents cosine similarity and $\delta$ is a hyperparameter. 3. **Noise - prediction loss** (the loss function in the standard diffusion model): \[ L_{\text{noise}}=\mathbb{E}_{x,\epsilon\sim\mathcal{N}(0,1),t}[\|\epsilon - \epsilon_\theta(z_t,t)\|_2^2] \] Through these improvements, ArtWeaver can generate high - quality images with diverse target styles while maintaining the semantic integrity of the text prompts.