Locally controllable network based on visual–linguistic relation alignment for text-to-image generation

Zaike Li,Li Liu,Huaxiang Zhang,Dongmei Liu,Yu Song,Boqun Li
DOI: https://doi.org/10.1007/s00530-023-01222-7
IF: 3.9
2024-01-20
Multimedia Systems
Abstract:Since locally controllable text-to-image generation cannot achieve satisfactory results in detail, a novel locally controllable text-to-image generation network based on visual–linguistic relation alignment is proposed. The goal of the method is to complete image processing and generation semantically through text guidance. The proposed method explores the relationship between text and image to achieve local control of text-to-image generation. The visual–linguistic matching learns the similarity weights between image and text through semantic features to achieve the fine-grained correspondence between local images and words. The instance-level optimization function is introduced into the generation process to accurately control the weight with low similarity and combine with text features to generate new visual attributes. In addition, a local control loss is proposed to preserve the details of the text and local regions of the image. Extensive experiments demonstrate the superior performance of the proposed method and enable more accurate control of the original image.
computer science, information systems, theory & methods
What problem does this paper attempt to address?