R-GAN: Exploring Human-likeWay for Reasonable Text-to-Image Synthesis Via Generative Adversarial Networks
Yanyuan Qiao,Qi Chen,Chaorui Deng,Ning Ding,Yuankai Qi,Mingkui Tan,Xincheng Ren,Qi Wu
DOI: https://doi.org/10.1145/3474085.3475363
2021-01-01
Abstract:Despite recent significant progress on generative models, contextrich text-to-image synthesis depicting multiple complex objects is still non-trivial. The main challenges lie in the ambiguous semantic of a complex description and the intricate scene of an image with various objects, different positional relationship and diverse appearances. To address these challenges, we propose R-GAN, which can generate reasonable images according to the given text in a human-like way. Specifically, just like humans will first find and settle the essential elements to create a simple sketch, we first capture a monolithic-structural text representation by building a scene graph to find the essential semantic elements. Then, based on this representation, we design a bounding box generator to estimate the layout with position and size of target objects, and a following shape generator, which draws a fine-detailed shape for each object. Different from previous work only generating coarse shapes blindly, we introduce a coarse-to-fine shape generator based on a shape knowledge base. At last, to finish the final image synthesis, we propose a multi-modal geometry-aware spatially-adaptive generator conditioned on the monolithic-structural text representation and the geometry-aware map of the shapes. Extensive experiments on the real-world dataset MSCOCO show the superiority of our method in terms of both quantitative and qualitative metrics.