Image Synthesis from Locally Related Texts

Tianrui Niu,Fangxiang Feng,Lingxuan Li,Xiaojie Wang
DOI: https://doi.org/10.1145/3372278.3390684
2020-01-01
Abstract:Text-to-image synthesis refers to generating photo-realistic images from text descriptions. Recent works focus on generating images with complex scenes and multiple objects. However, the text inputs to these models are the only captions that always describe the most apparent object or feature of the image and detailed information (e.g. visual attributes) for regions and objects are often missing. Quantitative evaluation of generation performances is still an unsolved problem, where traditional image classification- or retrieval-based metrics fail at evaluating complex images. To address these problems, we propose to generate images conditioned on locally-related texts, i.e., descriptions of local image regions or objects instead of the whole image. Specifically, questions and answers (QAs) are chosen as locally-related texts, which makes it possible to use VQA accuracy as a new evaluation metric. The intuition is simple: higher image quality and image-text consistency (both globally and locally) can help a VQA model answer questions more correctly. We purposed VQA-GAN model with three key modules: hierarchical QA encoder, QA-conditional GAN and external VQA loss. These modules help leverage the new inputs effectively. Thorough experiments on two public VQA datasets demonstrate the effectiveness of the model and the newly proposed metric.
What problem does this paper attempt to address?