TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Xingsong Ye,Yongkun Du,Yunbo Tao,Zhineng Chen
2024-12-02
Abstract:Scene text recognition (STR) suffers from the challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained STR models. Meanwhile, despite producing holistically appealing text images, diffusion-based text image generation methods struggle to generate accurate and realistic instance-level text on a large scale. To tackle this, we introduce TextSSR: a novel framework for Synthesizing Scene Text Recognition data via a diffusion-based universal text region synthesis model. It ensures accuracy by focusing on generating text within a specified image region and leveraging rich glyph and position information to create the less complex text region compared to the entire image. Furthermore, we utilize neighboring text within the region as a prompt to capture real-world font styles and layout patterns, guiding the generated text to resemble actual scenes. Finally, due to its prompt-free nature and capability for character-level synthesis, TextSSR enjoys a wonderful scalability and we construct an anagram-based TextSSR-F dataset with 0.4 million text instances with complexity and realism. Experiments show that models trained on added TextSSR-F data exhibit better accuracy compared to models trained on 4 million existing synthetic data. Moreover, its accuracy margin to models trained fully on a real-world dataset is less than 3.7%, confirming TextSSR's effectiveness and its great potential in scene text image synthesis. Our code is available at <a class="link-external link-https" href="https://github.com/YesianRohn/TextSSR" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems in Scene Text Recognition (STR): 1. **Synthetic training data is not realistic enough**: Existing synthetic data generation methods are either based on traditional rendering techniques, resulting in a lack of diversity in synthesized text images; or rely on diffusion models, but these models usually focus on generating overall aesthetically pleasing text images (such as posters) and it is difficult to generate accurate and realistic instance - level texts on a large scale. 2. **It is difficult to collect enough high - quality real - world data**: Collecting a large number of high - quality real - scene text images is both expensive and time - consuming, and low - frequency words in real - world scenarios are also difficult to obtain. Therefore, synthesizing high - quality text images has become an effective alternative. To solve these problems, the paper proposes the **TextSSR** framework, which aims to generate training data for scene text recognition through a general - purpose text - region - synthesis model based on diffusion models. Specifically, the goals of TextSSR are: - **Improve accuracy**: By focusing on generating text within a specified image area and using rich character and position information, ensure that the generated text content is accurate. - **Enhance realism**: Use neighboring text as a cue to capture real - world font styles and layout patterns, making the generated text closer to the actual scene. - **Achieve scalability**: Without relying on natural - language prompts, only need to specify the text position and the required text content, support character - level rendering, and thus be able to generate data on a large scale. Through these improvements, TextSSR can generate high - quality, diverse scene - text images, significantly improving the effectiveness of training STR models. Experimental results show that models trained with data generated by TextSSR perform well in multiple benchmark tests, approaching or even exceeding the performance of models trained with real data. ### Formula Summary The formulas involved in the paper are as follows: - **VAE loss function**: \[ L_{\text{TextSSR - VAE}}=\|V_{\theta}(x)-x\|_{2}^{2} \] - **CDM training loss function**: \[ L_{\text{TextSSR - CDM}}=\|\epsilon-\epsilon_{\theta}(z_{t}, t, Z_{M}, G, M, F_{g})\|_{2}^{2} \] - **Permutation and combination formula**: \[ A_{L}^{L}=L! \] These formulas are used to describe the training processes of VAE and CDM respectively, and to calculate the number of possible character permutations.