Abstract:Text-to-image synthesis aims to generate a photo-realistic and semantic consistent image from a specific text description. The images synthesized by off-the-shelf models usually contain limited components compared with the corresponding image and text description, which decreases the image quality and the textual-visual consistency. To address this issue, we propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*, which introduces a dual vision-language matching mechanism to strengthen the image quality and semantic consistency. The dual vision-language matching mechanism considers textual-visual matching between the generated image and the corresponding text description, and visual-visual consistent constraints between the synthesized image and the real image. Given a specific text description, VLMGAN* firstly encodes it into textual features and then feeds them to a dual vision-language matching-based generative model to synthesize a photo-realistic and textual semantic consistent image. Besides, the popular evaluation metrics for text-to-image synthesis are borrowed from simple image generation, which mainly evaluates the reality and diversity of the synthesized images. Therefore, we introduce a metric named Vision-Language Matching Score (VLMS) to evaluate the performance of text-to-image synthesis which can consider both the image quality and the semantic consistency between synthesized image and the description. The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods. We implement this strategy on two popular baselines, which are marked with ${\text{VLMGAN}_{+\text{AttnGAN}}}$ and ${\text{VLMGAN}_{+\text{DFGAN}}}$. The experimental results on two widely-used datasets show that the model achieves significant improvements over other state-of-the-art methods.

DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis.

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis

Diversified text-to-image generation via deep mutual information estimation

SAW-GAN: Multi-granularity Text Fusion Generative Adversarial Networks for text-to-image generation

DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis

SIMGAN: Photo-Realistic Semantic Image Manipulation Using Generative Adversarial Networks.

Multi-Semantic Fusion Generative Adversarial Network for Text-to-Image Generation

CF-GAN: cross-domain feature fusion generative adversarial network for text-to-image synthesis

DualG-GAN, a Dual-channel Generator based Generative Adversarial Network for text-to-face synthesis

SAM-GAN: Self-Attention supporting Multi-stage Generative Adversarial Networks for text-to-image synthesis

Tf-Gan: Text Feature Fusion Gan for Text-to-Image Generation

Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis.

DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for Text-to-Image Generation

ISF-GAN: Imagine, Select, and Fuse with GPT-Based Text Enrichment for Text-to-Image Synthesis

DGattGAN: Cooperative Up-Sampling Based Dual Generator Attentional GAN on Text-to-Image Synthesis

DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis

CMAFGAN: A Cross-Modal Attention Fusion based Generative Adversarial Network for attribute word-to-face synthesis

KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis

Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks