Abstract:The generation task from text to image generates cross modal data with consistent content by mining the semantic consistency contained in two different modal information of text and image. Due to the differences between the two modes, the task of text to image generation faces many difficulties and challenges. In this paper, we propose to boost the text-to-image synthesis through an adaptive learning and generating generative adversarial networks (ALG-GANs). First, we propose an adaptive forgetting mechanism in the generator to reduce the error accumulation and learn knowledge flexibly in the cascade structure. Besides, to evade the mode collapse caused by a strong biased surveillance, we propose a multi-task discriminator using weaksupervision information to guide the generator more comprehensively and maintain the semantic consistency in the cascade generation process. To avoid the refine difficulty aroused by the bad initialization, we judge the quality of initialization before further processing. The generator will re-sample the noise and re-initialize the bad initializations to obtain good ones. All the above contributions have been integrated in a unified framework, which is an adaptive forgetting, drafting and comprehensive guiding based text-to-image synthesis method with hierarchical generative adversarial networks. The model is evaluated on the Caltech-UCSD Birds 200 (CUB) dataset and the Oxford 102 Category Flowers (Oxford) dataset with standard metrics. The results on Inception Score (IS) and Fréchet Inception Distance (FID) show that our model outperforms the previous methods.

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

CogView: Mastering Text-to-Image Generation via Transformers

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

Emage: Non-Autoregressive Text-to-Image Generation

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

BENet: bi-directional enhanced network for image captioning

Neural Architecture Search with a Lightweight Transformer for Text-to-Image Synthesis

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

MVPbev: Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers

Controllable Textual Inversion for Personalized Text-to-Image Generation

GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features

SAW-GAN: Multi-granularity Text Fusion Generative Adversarial Networks for text-to-image generation

Adaptive Forgetting, Drafting and Comprehensive Guiding: Text-to-Image Synthesis with Hierarchical Generative Adversarial Networks

Show, tell and rectify: Boost image caption generation via an output rectifier