TextDiffuser: Diffusion Models as Text Painters

Jingye Chen,Yupan Huang,Tengchao Lv,Lei Cui,Qifeng Chen,Furu Wei
2023-10-30
Abstract:Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at \url{<a class="link-external link-https" href="https://aka.ms/textdiffuser" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue that existing diffusion models find it difficult to render accurate and coherent text when generating images. Although diffusion models have made significant progress in image generation, they still face challenges in generating visually satisfactory text on images. Specifically, the text generated by these models may not harmonize with the background or may appear unnatural against complex textures or lighting variations. Additionally, there is currently a lack of large-scale datasets specifically designed for this purpose. To tackle these issues, the paper proposes a new framework called TextDiffuser, which aims to generate images containing high-quality, background-coordinated text through a two-stage approach. The first stage uses a Transformer model to generate keyword layouts, and the second stage employs a diffusion model to generate images based on text prompts and the generated layouts. Furthermore, the paper contributes a large-scale dataset named MARIO-10M, which includes 10 million image-text pairs with OCR annotations, as well as an evaluation benchmark called MARIO-Eval for comprehensive assessment of text rendering quality. Through experiments and user studies, the paper demonstrates that TextDiffuser is not only capable of flexibly generating high-quality text images but also of performing text inpainting, i.e., reconstructing text in incomplete images. These results indicate that TextDiffuser outperforms existing methods in generating high-quality text images.