Comparative Analysis of Generative Models: Enhancing Image Synthesis with VAEs, GANs, and Stable Diffusion

Sanchayan Vivekananthan
2024-08-16
Abstract:This paper examines three major generative modelling frameworks: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Stable Diffusion models. VAEs are effective at learning latent representations but frequently yield blurry results. GANs can generate realistic images but face issues such as mode collapse. Stable Diffusion models, while producing high-quality images with strong semantic coherence, are demanding in terms of computational resources. Additionally, the paper explores how incorporating Grounding DINO and Grounded SAM with Stable Diffusion improves image accuracy by utilising sophisticated segmentation and inpainting techniques. The analysis guides on selecting suitable models for various applications and highlights areas for further research.
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
The paper aims to address various issues in generative models for image synthesis and explores how to improve the performance of these models by combining different techniques. Specifically, the paper compares three main generative modeling frameworks: Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Stable Diffusion models. Each model has its advantages and limitations: 1. **Variational Autoencoders (VAEs)**: - Advantages: Capable of effectively learning latent representations, suitable for learning complex probability distributions. - Limitations: Generated images have blurry edges, and there is a posterior collapse issue. 2. **Generative Adversarial Networks (GANs)**: - Advantages: Can generate high-quality, realistic images. - Limitations: Training is unstable, prone to mode collapse, and requires high computational resources. 3. **Stable Diffusion models**: - Advantages: Generate high-resolution, detail-rich images while maintaining semantic consistency. - Limitations: The inference process is time-consuming and requires high computational resources. Additionally, the paper explores methods to combine Grounding DINO and Grounded SAM with Stable Diffusion to improve the accuracy of image segmentation and object detection, thereby enhancing the effectiveness of image synthesis. While this approach improves image quality and consistency, it also increases computational complexity and the risk of overfitting. Through these analyses, the paper aims to guide researchers and practitioners in selecting the most suitable generative model architecture for their specific needs and points out directions for future research.