Abstract:Teaching text-to-image models to be creative involves using style ambiguity loss, which requires a pretrained classifier. In this work, we explore a new form of the style ambiguity training objective, used to approximate creativity, that does not require training a classifier or even a labeled dataset. We then train a diffusion model to maximize style ambiguity to imbue the diffusion model with creativity and find our new methods improve upon the traditional method, based on automated metrics for human judgment, while still maintaining creativity and novelty.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to improve the style ambiguity loss for image - generation models without relying on pre - trained classifiers or labeled datasets, in order to enhance the creativity and novelty of the models. Specifically, traditional style ambiguity loss methods require a pre - trained classifier to identify image styles, which not only increases the training cost but also requires labeled datasets. To solve these problems, the author proposes a new style ambiguity loss method that can be applied to diffusion models and does not require additional training of classifiers or the use of labeled datasets. In this way, the author hopes to simplify the model training process and improve the quality of generated images while maintaining creativity and novelty. ### Main contributions: 1. **Applying creative style ambiguity loss to diffusion models**: Diffusion models are easier to train than GANs and can generate higher - quality images. 2. **Developing creative style ambiguity loss based on CLIP and K - Means**: These methods do not require training a separate GAN - style classifier. 3. **Experimental proof that the new method is superior to traditional methods**: According to automated evaluation metrics, the samples generated by the new method are of higher quality while maintaining creativity and novelty. ### Method overview: - **Diffusion models**: Diffusion models generate images by gradually removing noise, avoiding the mode - collapse and unstable training problems in GANs. - **Style ambiguity loss**: The author proposes two new style ambiguity loss methods: - **CLIP - based classifier**: Utilize the pre - trained CLIP model to calculate the similarity between images and texts, and then obtain the classification result through softmax normalization. - **K - Means - based classifier**: Cluster image or text labels to generate cluster centers, calculate the distance between the generated image and these centers, and then obtain the classification result through softmax normalization. ### Experimental results: - **Quantitative evaluation**: Evaluation is carried out through three indicators: AVA score, image reward, and prompt similarity. The results show that the K - Means - based method is superior to the traditional DCGAN method in terms of human preference. - **Comparison with the baseline model**: The images generated by the model after DDPO training are significantly different in style from the baseline model, indicating that the model has learned new artistic styles. Through these improvements, the author has successfully enhanced the creativity and novelty of the image - generation model while reducing the dependence on pre - trained classifiers and labeled datasets.

Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

Using Style Ambiguity Loss to Improve Aesthetics of Diffusion Models

UATST: Towards Unpaired Arbitrary Text-Guided Style Transfer with Cross-Space Modulation

Improving Diffusion Models for Scene Text Editing with Dual Encoders

CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion

ArtFusion: Controllable Arbitrary Style Transfer using Dual Conditional Latent Diffusion Models

ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

Fine-grained Text Style Transfer with Diffusion-Based Language Models

3Dstyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models

StyleDrop: Text-to-Image Generation in Any Style

Customizing Text-to-Image Models with a Single Image Pair

StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models

Multi-Concept Customization of Text-to-Image Diffusion

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements

Multimodal Data Augmentation for Image Captioning using Diffusion Models

Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression

FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models