Abstract:Text-to-image synthesis aims to generate a photo-realistic and semantic consistent image from a specific text description. The images synthesized by off-the-shelf models usually contain limited components compared with the corresponding image and text description, which decreases the image quality and the textual-visual consistency. To address this issue, we propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*, which introduces a dual vision-language matching mechanism to strengthen the image quality and semantic consistency. The dual vision-language matching mechanism considers textual-visual matching between the generated image and the corresponding text description, and visual-visual consistent constraints between the synthesized image and the real image. Given a specific text description, VLMGAN* firstly encodes it into textual features and then feeds them to a dual vision-language matching-based generative model to synthesize a photo-realistic and textual semantic consistent image. Besides, the popular evaluation metrics for text-to-image synthesis are borrowed from simple image generation, which mainly evaluates the reality and diversity of the synthesized images. Therefore, we introduce a metric named Vision-Language Matching Score (VLMS) to evaluate the performance of text-to-image synthesis which can consider both the image quality and the semantic consistency between synthesized image and the description. The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods. We implement this strategy on two popular baselines, which are marked with ${\text{VLMGAN}_{+\text{AttnGAN}}}$ and ${\text{VLMGAN}_{+\text{DFGAN}}}$. The experimental results on two widely-used datasets show that the model achieves significant improvements over other state-of-the-art methods.

DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

Unified Vision-Language Pre-Training for Image Captioning and VQA

Unified Generative and Discriminative Training for Multi-modal Large Language Models

Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks

Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation

Multimodal Pre-training Method for Vision-language Understanding and Generation.

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Write and Paint: Generative Vision-Language Models are Unified Modal Learners

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

VLIS: Unimodal Language Models Guide Multimodal Language Generation

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Cross-Modal Dual Learning for Sentence-to-Video Generation

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning