Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

James Baker
2024-06-20
Abstract:Teaching text-to-image models to be creative involves using style ambiguity loss, which requires a pretrained classifier. In this work, we explore a new form of the style ambiguity training objective, used to approximate creativity, that does not require training a classifier or even a labeled dataset. We then train a diffusion model to maximize style ambiguity to imbue the diffusion model with creativity and find our new methods improve upon the traditional method, based on automated metrics for human judgment, while still maintaining creativity and novelty.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to improve the style ambiguity loss for image - generation models without relying on pre - trained classifiers or labeled datasets, in order to enhance the creativity and novelty of the models. Specifically, traditional style ambiguity loss methods require a pre - trained classifier to identify image styles, which not only increases the training cost but also requires labeled datasets. To solve these problems, the author proposes a new style ambiguity loss method that can be applied to diffusion models and does not require additional training of classifiers or the use of labeled datasets. In this way, the author hopes to simplify the model training process and improve the quality of generated images while maintaining creativity and novelty. ### Main contributions: 1. **Applying creative style ambiguity loss to diffusion models**: Diffusion models are easier to train than GANs and can generate higher - quality images. 2. **Developing creative style ambiguity loss based on CLIP and K - Means**: These methods do not require training a separate GAN - style classifier. 3. **Experimental proof that the new method is superior to traditional methods**: According to automated evaluation metrics, the samples generated by the new method are of higher quality while maintaining creativity and novelty. ### Method overview: - **Diffusion models**: Diffusion models generate images by gradually removing noise, avoiding the mode - collapse and unstable training problems in GANs. - **Style ambiguity loss**: The author proposes two new style ambiguity loss methods: - **CLIP - based classifier**: Utilize the pre - trained CLIP model to calculate the similarity between images and texts, and then obtain the classification result through softmax normalization. - **K - Means - based classifier**: Cluster image or text labels to generate cluster centers, calculate the distance between the generated image and these centers, and then obtain the classification result through softmax normalization. ### Experimental results: - **Quantitative evaluation**: Evaluation is carried out through three indicators: AVA score, image reward, and prompt similarity. The results show that the K - Means - based method is superior to the traditional DCGAN method in terms of human preference. - **Comparison with the baseline model**: The images generated by the model after DDPO training are significantly different in style from the baseline model, indicating that the model has learned new artistic styles. Through these improvements, the author has successfully enhanced the creativity and novelty of the image - generation model while reducing the dependence on pre - trained classifiers and labeled datasets.