Abstract:Scene text recognition in low-resource languages frequently faces challenges due to the limited availability of training datasets derived from real-world scenes. This study proposes a novel approach that generates text images in low-resource languages by emulating the style of real text images from high-resource languages. Our approach utilizes a diffusion model that is conditioned on binary states: ``synthetic'' and ``real.'' The training of this model involves dual translation tasks, where it transforms plain text images into either synthetic or real text images, based on the binary states. This approach not only effectively differentiates between the two domains but also facilitates the model's explicit recognition of characters in the target language. Furthermore, to enhance the accuracy and variety of generated text images, we introduce two guidance techniques: Fidelity-Diversity Balancing Guidance and Fidelity Enhancement Guidance. Our experimental results demonstrate that the text images generated by our proposed framework can significantly improve the performance of scene text recognition models for low-resource languages.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the challenges in scene text recognition for low - resource languages. Specifically, scene text recognition for low - resource languages often encounters difficulties due to the lack of training datasets in real - world scenarios. Existing methods usually rely on synthetic text images, but there is a domain gap between these synthetic images and real images, resulting in poor performance in practical applications. To solve this problem, the authors propose a new framework to generate text images for low - resource languages by imitating the style of real - text images in high - resource languages. The core of this framework is to use a diffusion model, which is conditioned on two - state ("synthetic" and "real") and trained through Dual Translation Learning (DTL). In addition, to improve the accuracy and diversity of the generated text images, the authors introduce two guiding techniques: Fidelity - Diversity Balancing Guidance (FDB Guidance) and Fidelity Enhancement Guidance (FE Guidance). ### Main contributions 1. **New text - image generation framework**: This framework utilizes the diffusion model with Dual Translation Learning and two - state conditioning, which can effectively imitate the style of real - text images and accurately understand and render characters in the target language. 2. **Two guiding techniques**: FDB Guidance and FE Guidance significantly improve the accuracy and diversity of the generated text images. ### Formula representation - The forward and reverse processes of the diffusion model are represented by the following formulas respectively: - Forward process: \[ q(x_t|x_{t - 1})=\mathcal{N}(\sqrt{1-\beta_t}x_{t - 1},\beta_tI) \] \[ q(x_{1:T}|x_0)=\prod_{t = 1}^Tq(x_t|x_{t - 1}) \] - Reverse process: \[ p_\theta(x_{t - 1}|x_t)=\mathcal{N}(\mu_\theta(x_t,t),\Sigma_\theta(x_t,t)) \] - Classifier - free guidance: \[ \tilde{\epsilon}_\theta(x_t,t,c,y)=\epsilon_\theta(x_t,t,c,y)+w(\epsilon_\theta(x_t,t,c,y)-\epsilon_\theta(x_t,t)) \] where \(w\) represents the guidance scale. - Scheduling of FDB guidance: \[ w_t=\left(\frac{t}{T}\right)w_{\text{min}}+\left(1-\frac{t}{T}\right)w_{\text{max}} \] Through these methods, this research significantly improves the performance of scene text recognition models for low - resource languages.

Text Image Generation for Low-Resource Languages with Dual Translation Learning

Diversified text-to-image generation via deep mutual information estimation

Emage: Non-Autoregressive Text-to-Image Generation

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Weakly supervised scene text generation for low-resource languages

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Conditional Text Image Generation with Diffusion Models

Improving Diffusion Models for Scene Text Editing with Dual Encoders

Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

Visual Text Generation in the Wild

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

Scene Text Image Super-resolution based on Text-conditional Diffusion Models

Synthetic images generation for text detection and recognition in the wild

Improving Text Generation on Images with Synthetic Captions

Diverse Diffusion: Enhancing Image Diversity in Text-to-Image Generation

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

SceneGenie: Scene Graph Guided Diffusion Models for Image Synthesis