Abstract:In this study, we identify the need for an interpretable, quantitative score of the repeatability, or consistency, of image generation in diffusion models. We propose a semantic approach, using a pairwise mean CLIP (Contrastive Language-Image Pretraining) score as our semantic consistency score. We applied this metric to compare two state-of-the-art open-source image generation diffusion models, Stable Diffusion XL and PixArt-{\alpha}, and we found statistically significant differences between the semantic consistency scores for the models. Agreement between the Semantic Consistency Score selected model and aggregated human annotations was 94%. We also explored the consistency of SDXL and a LoRA-fine-tuned version of SDXL and found that the fine-tuned model had significantly higher semantic consistency in generated images. The Semantic Consistency Score proposed here offers a measure of image generation alignment, facilitating the evaluation of model architectures for specific tasks and aiding in informed decision-making regarding model selection.

What problem does this paper attempt to address?

The paper aims to address the issue of quantifying the consistency or reproducibility of outputs from diffusion models in the field of image generation. Specifically, the researchers proposed a semantic-based approach that uses the pairwise average of CLIP (Contrastive Language-Image Pretraining) scores as a semantic consistency score to measure the stability and consistency of model outputs in image generation tasks. Through this method, the paper compared two state-of-the-art open-source image generation diffusion models—Stable Diffusion XL (SDXL) and PixArt-α, and statistically found significant differences between their semantic consistency scores. Moreover, the study also explored the semantic consistency between SDXL and its LoRA-finetuned version, finding that the finetuned model significantly improved in terms of the consistency of generated images. The innovation of the paper lies in introducing an interpretable, quantitative image generation consistency scoring criterion, which not only helps to evaluate the performance of different model architectures on specific tasks but also facilitates the trade-off decision between creativity and consistency during model selection. By quantifying the consistency of image generation, it is possible to assess the stability of models, detect potential biases, verify the interpretability of model outputs, and enhance user understanding. The Semantic Consistency Score adopted by the research combines the image embeddings of the CLIP model, providing a score between 0 and 100, with higher scores indicating greater semantic consistency in the generated images. The experimental section demonstrated that for 100 different prompts, PixArt-α had a higher average consistency score than SDXL, indicating that PixArt-α has greater stability and consistency when dealing with diverse prompts. Additionally, the choices of human annotators matched up to 94% with the model that had the highest semantic consistency score, confirming the effectiveness of this scoring method. Similar results were observed in the comparison between SDXL and the LoRA-finetuned version of SDXL, further proving the potential of finetuning techniques in enhancing model semantic consistency. In summary, the Semantic Consistency Score proposed by the paper provides a new perspective for evaluating image generation models, aiding in the selection of suitable models for applications requiring high consistency, and also offers a quantitative means to assess the effects of model finetuning.

Semantic Approach to Quantifying the Consistency of Diffusion Model Image Generation

Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

Are Diffusion Models Vision-And-Language Reasoners?

Non-Cross Diffusion for Semantic Consistency

SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis

Conditional Diffusion on Web-Scale Image Pairs leads to Diverse Image Variations

Stochastic Conditional Diffusion Models for Robust Semantic Image Synthesis

Unsupervised Semantic Correspondence Using Stable Diffusion

Semantic Image Synthesis Via Diffusion Models

IIDM: Image-to-Image Diffusion Model for Semantic Image Synthesis

Consistency Guided Diffusion Model with Neural Syntax for Perceptual Image Compression

Not All Steps Are Created Equal: Selective Diffusion Distillation for Image Manipulation

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment

Efficiency Meets Fidelity: A Novel Quantization Framework for Stable Diffusion

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Improved Techniques for Training Consistency Models

Semantic Probability Distribution Modeling for Diverse Semantic Image Synthesis

Attribute Based Interpretable Evaluation Metrics for Generative Models