Visual Text Generation in the Wild

Yuanzhi Zhu,Jiawei Liu,Feiyu Gao,Wenyu Liu,Xinggang Wang,Peng Wang,Fei Huang,Cong Yao,Zhibo Yang
2024-07-19
Abstract:Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper primarily focuses on addressing the challenges of Visual Text Generation in real-world scenarios, particularly in generating high-quality, coherent, and practical text images. Specifically, the paper introduces a new method called SceneVTG, which aims to overcome the limitations of existing rendering-based and diffusion-based approaches that struggle to simultaneously achieve optimal fidelity, coherence, and practicality. - **Fidelity**: The generated text images should seamlessly blend with the background without obvious artifacts, and the text content should be completely consistent with the given conditions, without any misspelled or extraneous text. - **Coherence**: The generated text areas and content should be in harmony with the image context, avoiding meaningless text. - **Practicality**: The generated images should enhance the performance of related tasks (such as text detection and recognition). The method proposed in the paper, SceneVTG, adopts a two-stage paradigm. It first uses a Multimodal Large Language Model (MLLM) to recommend reasonable text areas and content across multiple scales and levels, and then employs a conditional diffusion model to generate text images based on these conditions. Experimental results show that SceneVTG significantly outperforms traditional rendering-based and recent diffusion-based methods in terms of fidelity and coherence, while the generated images offer better practicality for tasks involving text detection and recognition. Furthermore, the paper contributes a new dataset called SceneVTG-Erase, which contains 155K scene text images and their backgrounds with the text erased, along with detailed OCR annotations for training models. Through comparative experiments, the paper demonstrates the advantages of SceneVTG in generating images with high fidelity, coherence, and practicality, especially in the generation of small-sized text and curved multi-scale text distributions, as well as its application effectiveness in text detection and recognition tasks.