Semantic Approach to Quantifying the Consistency of Diffusion Model Image Generation

Brinnae Bent
2024-04-13
Abstract:In this study, we identify the need for an interpretable, quantitative score of the repeatability, or consistency, of image generation in diffusion models. We propose a semantic approach, using a pairwise mean CLIP (Contrastive Language-Image Pretraining) score as our semantic consistency score. We applied this metric to compare two state-of-the-art open-source image generation diffusion models, Stable Diffusion XL and PixArt-{\alpha}, and we found statistically significant differences between the semantic consistency scores for the models. Agreement between the Semantic Consistency Score selected model and aggregated human annotations was 94%. We also explored the consistency of SDXL and a LoRA-fine-tuned version of SDXL and found that the fine-tuned model had significantly higher semantic consistency in generated images. The Semantic Consistency Score proposed here offers a measure of image generation alignment, facilitating the evaluation of model architectures for specific tasks and aiding in informed decision-making regarding model selection.
Computer Vision and Pattern Recognition,Artificial Intelligence,Human-Computer Interaction,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of quantifying the consistency or reproducibility of outputs from diffusion models in the field of image generation. Specifically, the researchers proposed a semantic-based approach that uses the pairwise average of CLIP (Contrastive Language-Image Pretraining) scores as a semantic consistency score to measure the stability and consistency of model outputs in image generation tasks. Through this method, the paper compared two state-of-the-art open-source image generation diffusion models—Stable Diffusion XL (SDXL) and PixArt-α, and statistically found significant differences between their semantic consistency scores. Moreover, the study also explored the semantic consistency between SDXL and its LoRA-finetuned version, finding that the finetuned model significantly improved in terms of the consistency of generated images. The innovation of the paper lies in introducing an interpretable, quantitative image generation consistency scoring criterion, which not only helps to evaluate the performance of different model architectures on specific tasks but also facilitates the trade-off decision between creativity and consistency during model selection. By quantifying the consistency of image generation, it is possible to assess the stability of models, detect potential biases, verify the interpretability of model outputs, and enhance user understanding. The Semantic Consistency Score adopted by the research combines the image embeddings of the CLIP model, providing a score between 0 and 100, with higher scores indicating greater semantic consistency in the generated images. The experimental section demonstrated that for 100 different prompts, PixArt-α had a higher average consistency score than SDXL, indicating that PixArt-α has greater stability and consistency when dealing with diverse prompts. Additionally, the choices of human annotators matched up to 94% with the model that had the highest semantic consistency score, confirming the effectiveness of this scoring method. Similar results were observed in the comparison between SDXL and the LoRA-finetuned version of SDXL, further proving the potential of finetuning techniques in enhancing model semantic consistency. In summary, the Semantic Consistency Score proposed by the paper provides a new perspective for evaluating image generation models, aiding in the selection of suitable models for applications requiring high consistency, and also offers a quantitative means to assess the effects of model finetuning.