Abstract:The crux of text-to-image synthesis stems from the difficulty of preserving the cross-modality semantic consistency between the input text and the synthesized image. Typical methods, which seek to model the text-to-image mapping directly, could only capture keywords in the text that indicates common objects or actions but fail to learn their spatial distribution patterns. An effective way to circumvent this limitation is to generate an image layout as guidance, which is attempted by a few methods. Nevertheless, these methods fail to generate practically effective layouts due to the diversity of input text and object location. In this paper we push for effective modeling in both text-to-layout generation and layout-to-image synthesis. Specifically, we formulate the text-to-layout generation as a sequence-to-sequence modeling task, and build our model upon Transformer to learn the spatial relationships between objects by modeling the sequential dependencies between them. In the stage of layout-to-image synthesis, we focus on learning the textual-visual semantic alignment per object in the layout to precisely incorporate the input text into the layout-to-image synthesizing process. To evaluate the quality of generated layout, we design a new metric specifically, dubbed Layout Quality Score, which considers both the absolute distribution errors of bounding boxes in the layout and the mutual spatial relationships between them. Extensive experiments on three datasets demonstrate the superior performance of our method over state-of-the-art methods on both predicting the layout and synthesizing the image from the given text.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is **the problem of cross - modal semantic consistency in text - to - image synthesis**. Specifically, existing methods can only capture the keywords in the text when directly generating images from text, but are unable to learn the spatial distribution patterns of these keywords in the image, that is, the image layout. This results in the generated images having an unreasonable object layout and incorrect spatial relationships between objects. To solve this problem, the paper proposes a new method, which generates an image layout consistent with the input text as an intermediate step to guide the final image synthesis process. ### Main Contributions 1. **Designed a Transformer - based model to generate high - quality image layouts**, which can be consistent with the input text. 2. **Proposed a layout - to - image synthesizer**, which can combine the generated layout and the input text to generate high - quality images. 3. **Introduced quantitative metrics to evaluate the quality of the generated layout**, which consider the absolute distribution error of bounding boxes in the layout and their spatial relationships with each other. 4. **Extensive experiments on three datasets show** that the proposed method outperforms existing methods in both predicting image layouts and synthesizing images from a given text. ### Method Overview 1. **Text - to - Layout Generation**: - Use the encoder of the Transformer model to encode the input text and learn the latent representation of the text. - Generate the image layout through the decoder, model the layout generation process as a sequence prediction task, and predict the category and location of each object. - Adopt a joint classification strategy to predict the category and location of the object simultaneously to improve the prediction accuracy. 2. **Layout - to - Image Synthesis**: - Designed a text - aligned layout - to - image synthesizer (TALIS), which combines the generated layout and the input text to generate the final image. - Incorporate layout information into the generation process through the feature normalization module (ISLA - Norm) to ensure the consistency between the layout and the image. - Learn text - visual semantic alignment to ensure that the generated image is semantically consistent with the input text. ### Experimental Results The paper conducted experiments on three datasets, COCO, COCO - stuff, and LN - COCO, and the results show that the proposed method outperforms existing methods in both generating layouts and synthesizing images. ### Formula Summary - **Text Encoding**: \[ \{s_1, \ldots, s_{T_i}\} = F_e(\{e_1, \ldots, e_{T_i}\}) \] where \( T_i \) is the length of the input text, and \( F_e \) is the transformation function of the text encoder. - **Layout Decoding**: \[ h_t = F_d(S, O_{1:t - 1}) \] where \( S=\{s_1, \ldots, s_T\} \) is the latent representation obtained from the text encoder, \( O_{1:t - 1} \) is the previously predicted result, and \( F_d \) is the transformation function of the layout decoder. - **Joint Classification**: \[ v_t=\arg\max_{i \in [1, S\times S\times C]} (p_i^t), \quad p_t = \text{Softmax}(Mh_t) \] where \( M \) is a linear transformation, \( p_t\in\mathbb{R}^{S\times S\times C} \) is the classification probability, and \( C \) is the total number of object categories. - **Fine - grained Regression**: \[ \{f_x^t, f_y^t, f_w^t, f_h^t\} = F_{\text{reg}}(h_t) \] where \( f_x^t, f_y^t \) are the coordinates of the center of the bounding box within the grid cell, and \( f_w^t, f_h^t \) are the sizes of the bounding box. - **Loss Function**: \[ L_{\text{layout}} = L_{\tex

Layout-Bridging Text-to-Image Synthesis

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

End-to-End Text-to-Image Synthesis with Spatial Constrains

GOAL: Grounded Text-to-image Synthesis with Joint Layout Alignment Tuning

LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions

A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions

Enhancing Object Coherence in Layout-to-Image Synthesis

Training-free Composite Scene Generation for Layout-to-Image Synthesis

LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis

Layout2image: Image Generation from Layout

InstructLayout: Instruction-Driven 2D and 3D Layout Synthesis with Semantic Graph Prior

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Smarttext: Learning to Generate Harmonious Textual Layout over Natural Image

Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs

Image Synthesis from Layout with Locality-Aware Mask Adaption

Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

Aesthetic Text Logo Synthesis via Content-aware Layout Inferring

Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis