TextDiffuser: Diffusion Models as Text Painters

Jingye Chen,Yupan Huang,Tengchao Lv,Lei Cui,Qifeng Chen,Furu Wei

2023-10-30

Abstract:Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at \url{<a class="link-external link-https" href="https://aka.ms/textdiffuser" rel="external noopener nofollow">this https URL</a>}.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue that existing diffusion models find it difficult to render accurate and coherent text when generating images. Although diffusion models have made significant progress in image generation, they still face challenges in generating visually satisfactory text on images. Specifically, the text generated by these models may not harmonize with the background or may appear unnatural against complex textures or lighting variations. Additionally, there is currently a lack of large-scale datasets specifically designed for this purpose. To tackle these issues, the paper proposes a new framework called TextDiffuser, which aims to generate images containing high-quality, background-coordinated text through a two-stage approach. The first stage uses a Transformer model to generate keyword layouts, and the second stage employs a diffusion model to generate images based on text prompts and the generated layouts. Furthermore, the paper contributes a large-scale dataset named MARIO-10M, which includes 10 million image-text pairs with OCR annotations, as well as an evaluation benchmark called MARIO-Eval for comprehensive assessment of text rendering quality. Through experiments and user studies, the paper demonstrates that TextDiffuser is not only capable of flexibly generating high-quality text images but also of performing text inpainting, i.e., reconstructing text in incomplete images. These results indicate that TextDiffuser outperforms existing methods in generating high-quality text images.

TextDiffuser: Diffusion Models as Text Painters

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

DiffUTE: Universal Text Editing Diffusion Model

GlyphDiffusion: Text Generation as Image Generation

CustomText: Customized Textual Image Generation using Diffusion Models

Improving Diffusion Models for Scene Text Editing with Dual Encoders

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

AnyText: Multilingual Visual Text Generation And Editing

Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

AltDiffusion: A Multilingual Text-to-Image Diffusion Model

TextCraftor: Your Text Encoder Can be Image Quality Controller

DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

iDesigner: A High-Resolution and Complex-Prompt Following Text-to-Image Diffusion Model for Interior Design

PAI-Diffusion: Constructing and Serving a Family of Open Chinese Diffusion Models for Text-to-image Synthesis on the Cloud

Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing