Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2

Ali Borji
2023-06-06
Abstract:The field of image synthesis has made great strides in the last couple of years. Recent models are capable of generating images with astonishing quality. Fine-grained evaluation of these models on some interesting categories such as faces is still missing. Here, we conduct a quantitative comparison of three popular systems including Stable Diffusion, Midjourney, and DALL-E 2 in their ability to generate photorealistic faces in the wild. We find that Stable Diffusion generates better faces than the other systems, according to the FID score. We also introduce a dataset of generated faces in the wild dubbed GFW, including a total of 15,076 faces. Furthermore, we hope that our study spurs follow-up research in assessing the generative models and improving them. Data and code are available at data and code, respectively.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on comparing the capabilities of three popular image generation models—Stable Diffusion, Midjourney, and DALL·E 2—in synthesizing realistic human faces. Specifically, the study aims to evaluate the quality of faces generated in complex scenes, rather than those optimized specifically for portraits. The authors conducted experiments through the following steps: 1. **Model Selection**: Three models, namely Stable Diffusion, Midjourney, and DALL·E 2, were selected for comparison. 2. **Dataset Construction**: To obtain a dataset for generating human faces, the authors used captions from the COCO dataset as prompts to generate images and detected faces from them. Additionally, they collected real-world face data, including faces from the COCO training set and the Labeled Faces in the Wild (LFW) dataset. 3. **Quality Evaluation**: The Fréchet Inception Distance (FID) score was used as a metric to measure the similarity between the generated faces and real faces. The study found that Stable Diffusion performed the best in terms of the quality of generated faces. According to the FID score, it was more capable of generating realistic faces compared to the other two models. However, despite achieving better results, there remains a significant gap between the generated faces and real faces, indicating substantial room for improvement. Future research directions may include: - Increasing the number of face samples generated by DALL·E 2 for a more comprehensive comparison. - Investigating whether the generation systems exhibit data memorization. - Exploring whether the generated faces have issues of social bias. - Using metrics more suitable for face evaluation (such as SSIM, LPIPS, etc.) for assessment. - Conducting a more detailed analysis of facial features (such as expressions, age, viewpoints, etc.).