Abstract:Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.

What problem does this paper attempt to address?

This paper primarily focuses on addressing the challenges of Visual Text Generation in real-world scenarios, particularly in generating high-quality, coherent, and practical text images. Specifically, the paper introduces a new method called SceneVTG, which aims to overcome the limitations of existing rendering-based and diffusion-based approaches that struggle to simultaneously achieve optimal fidelity, coherence, and practicality. - **Fidelity**: The generated text images should seamlessly blend with the background without obvious artifacts, and the text content should be completely consistent with the given conditions, without any misspelled or extraneous text. - **Coherence**: The generated text areas and content should be in harmony with the image context, avoiding meaningless text. - **Practicality**: The generated images should enhance the performance of related tasks (such as text detection and recognition). The method proposed in the paper, SceneVTG, adopts a two-stage paradigm. It first uses a Multimodal Large Language Model (MLLM) to recommend reasonable text areas and content across multiple scales and levels, and then employs a conditional diffusion model to generate text images based on these conditions. Experimental results show that SceneVTG significantly outperforms traditional rendering-based and recent diffusion-based methods in terms of fidelity and coherence, while the generated images offer better practicality for tasks involving text detection and recognition. Furthermore, the paper contributes a new dataset called SceneVTG-Erase, which contains 155K scene text images and their backgrounds with the text erased, along with detailed OCR annotations for training models. Through comparative experiments, the paper demonstrates the advantages of SceneVTG in generating images with high fidelity, coherence, and practicality, especially in the generation of small-sized text and curved multi-scale text distributions, as well as its application effectiveness in text detection and recognition tasks.

Visual Text Generation in the Wild

Diversified text-to-image generation via deep mutual information estimation

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

AnyText: Multilingual Visual Text Generation And Editing

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

Conditional Text Image Generation with Diffusion Models

Interactive Visual Assessment for Text-to-Image Generation Models

Text Pared into Scene Graph for Diverse Image Generation.

ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation

Expressive Text-to-Image Generation with Rich Text

Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual Text Processing

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

Learning to Imagine: Visually-Augmented Natural Language Generation

SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis