Abstract:Scene text recognition (STR) suffers from the challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained STR models. Meanwhile, despite producing holistically appealing text images, diffusion-based text image generation methods struggle to generate accurate and realistic instance-level text on a large scale. To tackle this, we introduce TextSSR: a novel framework for Synthesizing Scene Text Recognition data via a diffusion-based universal text region synthesis model. It ensures accuracy by focusing on generating text within a specified image region and leveraging rich glyph and position information to create the less complex text region compared to the entire image. Furthermore, we utilize neighboring text within the region as a prompt to capture real-world font styles and layout patterns, guiding the generated text to resemble actual scenes. Finally, due to its prompt-free nature and capability for character-level synthesis, TextSSR enjoys a wonderful scalability and we construct an anagram-based TextSSR-F dataset with 0.4 million text instances with complexity and realism. Experiments show that models trained on added TextSSR-F data exhibit better accuracy compared to models trained on 4 million existing synthetic data. Moreover, its accuracy margin to models trained fully on a real-world dataset is less than 3.7%, confirming TextSSR's effectiveness and its great potential in scene text image synthesis. Our code is available at <a class="link-external link-https" href="https://github.com/YesianRohn/TextSSR" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve two main problems in Scene Text Recognition (STR): 1. **Synthetic training data is not realistic enough**: Existing synthetic data generation methods are either based on traditional rendering techniques, resulting in a lack of diversity in synthesized text images; or rely on diffusion models, but these models usually focus on generating overall aesthetically pleasing text images (such as posters) and it is difficult to generate accurate and realistic instance - level texts on a large scale. 2. **It is difficult to collect enough high - quality real - world data**: Collecting a large number of high - quality real - scene text images is both expensive and time - consuming, and low - frequency words in real - world scenarios are also difficult to obtain. Therefore, synthesizing high - quality text images has become an effective alternative. To solve these problems, the paper proposes the **TextSSR** framework, which aims to generate training data for scene text recognition through a general - purpose text - region - synthesis model based on diffusion models. Specifically, the goals of TextSSR are: - **Improve accuracy**: By focusing on generating text within a specified image area and using rich character and position information, ensure that the generated text content is accurate. - **Enhance realism**: Use neighboring text as a cue to capture real - world font styles and layout patterns, making the generated text closer to the actual scene. - **Achieve scalability**: Without relying on natural - language prompts, only need to specify the text position and the required text content, support character - level rendering, and thus be able to generate data on a large scale. Through these improvements, TextSSR can generate high - quality, diverse scene - text images, significantly improving the effectiveness of training STR models. Experimental results show that models trained with data generated by TextSSR perform well in multiple benchmark tests, approaching or even exceeding the performance of models trained with real data. ### Formula Summary The formulas involved in the paper are as follows: - **VAE loss function**: \[ L_{\text{TextSSR - VAE}}=\|V_{\theta}(x)-x\|_{2}^{2} \] - **CDM training loss function**: \[ L_{\text{TextSSR - CDM}}=\|\epsilon-\epsilon_{\theta}(z_{t}, t, Z_{M}, G, M, F_{g})\|_{2}^{2} \] - **Permutation and combination formula**: \[ A_{L}^{L}=L! \] These formulas are used to describe the training processes of VAE and CDM respectively, and to calculate the number of possible character permutations.

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

Scene Text Image Super-resolution based on Text-conditional Diffusion Models

Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution

DiffusionSTR: Diffusion Model for Scene Text Recognition

Scene Text Image Super-Resolution in the Wild

STIRER: A Unified Model for Low-Resolution Scene Text Image Recovery and Recognition

TextNeRF: A Novel Scene-Text Image Synthesis Method Based on Neural Radiance Fields

Synthesizing Data for Text Recognition with Style Transfer

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

DiffSTR: Controlled Diffusion Models for Scene Text Removal

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models

PEAN: A Diffusion-Based Prior-Enhanced Attention Network for Scene Text Image Super-Resolution

SynthText3D: synthesizing scene text images from 3D virtual worlds

A Feasible Framework for Arbitrary-Shaped Scene Text Recognition

Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes

Improving Diffusion Models for Scene Text Editing with Dual Encoders

A Scene-Text Synthesis Engine Achieved Through Learning from Decomposed Real-World Data