Abstract:Text-driven style transfer aims to merge the style of a reference image with content described by a text prompt. Recent advancements in text-to-image models have improved the nuance of style transformations, yet significant challenges remain, particularly with overfitting to reference styles, limiting stylistic control, and misaligning with textual content. In this paper, we propose three complementary strategies to address these issues. First, we introduce a cross-modal Adaptive Instance Normalization (AdaIN) mechanism for better integration of style and text features, enhancing alignment. Second, we develop a Style-based Classifier-Free Guidance (SCFG) approach that enables selective control over stylistic elements, reducing irrelevant influences. Finally, we incorporate a teacher model during early generation stages to stabilize spatial layouts and mitigate artifacts. Our extensive evaluations demonstrate significant improvements in style transfer quality and alignment with textual prompts. Furthermore, our approach can be integrated into existing style transfer frameworks without fine-tuning.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve?
This paper aims to solve several key challenges encountered in text - driven style transfer:
1. **Style over - fitting**:
- Existing style transfer models often over - copy the details in the reference style image when generating an image. This not only limits the aesthetic flexibility of the generated image but also reduces its adaptability to different style or content requirements. For example, the generated image may rely too much on the color, texture and other features of the reference style image, while ignoring the specific descriptions in the text prompt.
2. **Text alignment accuracy**:
- In the process of text - to - image generation, existing models often give priority to the dominant color or pattern in the reference style image, even if these features contradict the text prompt. This rigid treatment weakens the model's ability to interpret and incorporate subtle text guidance, resulting in a decline in the precision and customization ability of the generated output.
3. **Layout instability and artifact problems**:
- Style transfer may introduce unwanted artifacts, which undermine the stability of the underlying text - to - image generation model. For example, the common "checkerboard effect" will inadvertently appear as a repeating pattern in the generated image, regardless of the user's instructions.
To solve these problems, the paper proposes the following three complementary strategies:
1. **Cross - modal Adaptive Instance Normalization (AdaIN)**:
- By introducing the AdaIN mechanism, the style image features and text features are better fused, so as to more harmoniously guide the generation of the final image during the generation process, making the style features more consistent with the text instructions.
2. **Style - based Classifier - Free Guidance (SCFG)**:
- A new style - guiding method is developed, which allows for selective control of style elements and reduces irrelevant influences. Specifically, a "negative sample" image is generated using a layout - control generation model (such as ControlNet). This image retains the overall content of the reference image but excludes the target style elements, thus helping the model focus on transferring specific style components.
3. **Teacher model for layout stability**:
- A "teacher model" is introduced in the early stage of generation to share the spatial attention map, ensuring a stable spatial distribution and effectively alleviating problems such as the checkerboard effect. The teacher model is based on the original text - to - image model, performs denoising generation under the same text prompt, and shares the spatial attention map with the style model at each time step.
Through these strategies, the method proposed in the paper significantly improves the quality of style transfer and the consistency with text prompts, and can be integrated into the existing style transfer framework without fine - tuning.
### Summary
The main contributions of the paper include:
- Proposing a cross - modal AdaIN mechanism to better integrate style and text features.
- Introducing style - based Classifier - Free Guidance (SCFG) to achieve selective control of style elements.
- Using a teacher model to enhance the stability of the spatial layout and alleviate artifact problems.
- Through extensive evaluation, demonstrating the superior performance of this method under multiple styles and prompts.
These improvements enable the generated image to be not only faithful to the text prompt, but also avoid unnecessary style over - fitting, while maintaining a stable layout structure.