Abstract:The crux of text-to-image synthesis stems from the difficulty of preserving the cross-modality semantic consistency between the input text and the synthesized image. Typical methods, which seek to model the text-to-image mapping directly, could only capture keywords in the text that indicates common objects or actions but fail to learn their spatial distribution patterns. An effective way to circumvent this limitation is to generate an image layout as guidance, which is attempted by a few methods. Nevertheless, these methods fail to generate practically effective layouts due to the diversity of input text and object location. In this paper we push for effective modeling in both text-to-layout generation and layout-to-image synthesis. Specifically, we formulate the text-to-layout generation as a sequence-to-sequence modeling task, and build our model upon Transformer to learn the spatial relationships between objects by modeling the sequential dependencies between them. In the stage of layout-to-image synthesis, we focus on learning the textual-visual semantic alignment per object in the layout to precisely incorporate the input text into the layout-to-image synthesizing process. To evaluate the quality of generated layout, we design a new metric specifically, dubbed Layout Quality Score, which considers both the absolute distribution errors of bounding boxes in the layout and the mutual spatial relationships between them. Extensive experiments on three datasets demonstrate the superior performance of our method over state-of-the-art methods on both predicting the layout and synthesizing the image from the given text.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is **the problem of cross - modal semantic consistency in text - to - image synthesis**. Specifically, existing methods can only capture the keywords in the text when directly generating images from text, but are unable to learn the spatial distribution patterns of these keywords in the image, that is, the image layout. This results in the generated images having an unreasonable object layout and incorrect spatial relationships between objects. To solve this problem, the paper proposes a new method, which generates an image layout consistent with the input text as an intermediate step to guide the final image synthesis process.
### Main Contributions
1. **Designed a Transformer - based model to generate high - quality image layouts**, which can be consistent with the input text.
2. **Proposed a layout - to - image synthesizer**, which can combine the generated layout and the input text to generate high - quality images.
3. **Introduced quantitative metrics to evaluate the quality of the generated layout**, which consider the absolute distribution error of bounding boxes in the layout and their spatial relationships with each other.
4. **Extensive experiments on three datasets show** that the proposed method outperforms existing methods in both predicting image layouts and synthesizing images from a given text.
### Method Overview
1. **Text - to - Layout Generation**:
- Use the encoder of the Transformer model to encode the input text and learn the latent representation of the text.
- Generate the image layout through the decoder, model the layout generation process as a sequence prediction task, and predict the category and location of each object.
- Adopt a joint classification strategy to predict the category and location of the object simultaneously to improve the prediction accuracy.
2. **Layout - to - Image Synthesis**:
- Designed a text - aligned layout - to - image synthesizer (TALIS), which combines the generated layout and the input text to generate the final image.
- Incorporate layout information into the generation process through the feature normalization module (ISLA - Norm) to ensure the consistency between the layout and the image.
- Learn text - visual semantic alignment to ensure that the generated image is semantically consistent with the input text.
### Experimental Results
The paper conducted experiments on three datasets, COCO, COCO - stuff, and LN - COCO, and the results show that the proposed method outperforms existing methods in both generating layouts and synthesizing images.
### Formula Summary
- **Text Encoding**:
\[
\{s_1, \ldots, s_{T_i}\} = F_e(\{e_1, \ldots, e_{T_i}\})
\]
where \( T_i \) is the length of the input text, and \( F_e \) is the transformation function of the text encoder.
- **Layout Decoding**:
\[
h_t = F_d(S, O_{1:t - 1})
\]
where \( S=\{s_1, \ldots, s_T\} \) is the latent representation obtained from the text encoder, \( O_{1:t - 1} \) is the previously predicted result, and \( F_d \) is the transformation function of the layout decoder.
- **Joint Classification**:
\[
v_t=\arg\max_{i \in [1, S\times S\times C]} (p_i^t), \quad p_t = \text{Softmax}(Mh_t)
\]
where \( M \) is a linear transformation, \( p_t\in\mathbb{R}^{S\times S\times C} \) is the classification probability, and \( C \) is the total number of object categories.
- **Fine - grained Regression**:
\[
\{f_x^t, f_y^t, f_w^t, f_h^t\} = F_{\text{reg}}(h_t)
\]
where \( f_x^t, f_y^t \) are the coordinates of the center of the bounding box within the grid cell, and \( f_w^t, f_h^t \) are the sizes of the bounding box.
- **Loss Function**:
\[
L_{\text{layout}} = L_{\tex