Expressive Text-to-Image Generation with Rich Text

Songwei Ge,Taesung Park,Jun-Yan Zhu,Jia-Bin Huang
2024-05-29
Abstract:Plain text has become a prevalent interface for text-to-image synthesis. However, its limited customization options hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, and footnote. We extract each word's attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis. We achieve these capabilities through a region-based diffusion process. We first obtain each word's region based on attention maps of a diffusion process using plain text. For each region, we enforce its text attributes by creating region-specific detailed prompts and applying region-specific guidance, and maintain its fidelity against plain-text generation through region-based injections. We present various examples of image generation from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.
Computer Vision and Pattern Recognition,Graphics,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address several key issues encountered when using plain text descriptions to generate target images in text-to-image generation: 1. **Precise Control**: Plain text struggles to accurately describe complex image details, such as specific color values and the importance of objects. This makes it difficult for users to precisely express their desired output. 2. **Complex Scene Description**: Writing detailed and complex text prompts is very cumbersome, and existing text encoders also face challenges in interpreting these complex prompts. 3. **Precise Specification of Colors and Styles**: Plain text cannot directly specify continuous quantities, such as specific RGB color values or the importance of each word. To solve these problems, the authors propose a method using a rich text editor that supports multiple formats (such as font style, size, color, embedded images, and footnotes) to achieve finer control over text-to-image generation. Specifically, by extracting the attributes of each word, the method can achieve local style control, explicit token reweighting, precise color rendering, and detailed region composition. These functions are realized through a region-based diffusion process.