Abstract:Abstract In the field of content generation by machine, the state-of-the-art text-to-image model, DALL⋅E, has advanced and diverse capacities for the combinational image generation with specific textual prompts. The images generated by DALL⋅E seem to exhibit an appreciable level of combinational creativity close to that of humans in terms of visualizing a combinational idea. Although there are several common metrics which can be applied to assess the quality of the images generated by generative models, such as IS, FID, GIQA, and CLIP, it is unclear whether these metrics are equally applicable to assessing images containing combinational creativity. In this study, we collected the generated image data from machine (DALL⋅E) and human designers, respectively. The results of group ranking in the Consensual Assessment Technique (CAT) and the Turing Test (TT) were used as the benchmarks to assess the combinational creativity. Considering the metrics’ mathematical principles and different starting points in evaluating image quality, we introduced coincident rate (CR) and average rank variation (ARV) which are two comparable spaces. An experiment to calculate the consistency of group ranking of each metric by comparing the benchmarks then was conducted. By comparing the consistency results of CR and ARV on group ranking, we summarized the applicability of the existing evaluation metrics in assessing generative images containing combinational creativity. In the four metrics, GIQA performed the closest consistency to the CAT and TT. It shows the potential as an automated assessment for images containing combinational creativity, which can be used to evaluate the images containing combinational creativity in the relevant task of design and engineering such as conceptual sketch, digital design image, and prototyping image.

A novel measure to evaluate generative adversarial networks based on direct analysis of generated images

Evaluating Text-to-Image GANs Performance: A Comparative Analysis of Evaluation Metrics

No One Can Escape: A General Approach to Detect Tampered and Generated Image

Quality Evaluation of GANs Using Cross Local Intrinsic Dimensionality

An empirical study on evaluation metrics of generative adversarial networks

Pros and cons of GAN evaluation measures

GIQA: Generated Image Quality Assessment

Revisiting the Evaluation of Image Synthesis with GANs

On the Evaluation of Generative Adversarial Networks By Discriminative Models

Use of Neural Signals to Evaluate the Quality of Generative Adversarial Network Performance in Facial Image Generation

A Neuro-AI Interface for Evaluating Generative Adversarial Networks

Generative adversarial networks (GANs): Introduction, Taxonomy, Variants, Limitations, and Applications

Generalized Visual Quality Assessment of GAN-Generated Face Images

Synthetic-Neuroscore: Using A Neuro-AI Interface for Evaluating Generative Adversarial Networks

Detecting GAN generated errors

A study of the evaluation metrics for generative images containing combinational creativity

An Assessment of GANs for Identity-related Applications

Assessing the ability of generative adversarial networks to learn canonical medical image statistics

A brief study of generative adversarial networks and their applications in image synthesis

Likelihood Estimation for Generative Adversarial Networks