Abstract:Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of text editing in natural scenes, particularly the challenge of replacing text in images while preserving details such as color, font size, and background. Specifically, the paper proposes a diffusion model-based approach called **DBEST (Diffusion-Based Scene Text Manipulation Network)** to improve the performance of existing methods in editing text in complex scenes. #### Main Contributions 1. **Identifying the limitations of existing diffusion models**: Current image editing methods based on diffusion models perform poorly when editing text in natural scenes, especially in preserving details. 2. **Proposing a new diffusion model**: By employing a technique similar to text inversion, a diffusion model suitable for scene text editing is proposed. This model generates images without modifying the original LDM, making it easier to integrate with other text-related pipelines. 3. **Synthetic dataset**: As part of the training strategy, a synthetic dataset is introduced to avoid the cost of collecting real scene text data. 4. **One-shot style adaptation**: A one-shot style adaptation strategy is proposed to maintain the source style during the editing process. 5. **Text recognition guidance**: Utilizing a text recognition model for classifier guidance significantly improves performance in both quality and quantity. On the SynText dataset, the proposed network achieved an OCR word accuracy of 84.83%, a 13% improvement over the best existing method. #### Experimental Results - On the COCO-Text and ICDAR2013 datasets, the method achieved character-level evaluation accuracies of 94.15% and 98.12%, respectively. - Compared to existing methods, this approach excels in image quality (PSNR, SSIM, LPIPS) and OCR accuracy. - A series of ablation experiments validated the effectiveness of each component, particularly the importance of the synthetic dataset and text recognition guidance. Through these contributions, the paper demonstrates the superior performance of DBEST in natural scene text editing and provides new directions for research in this field.

On Manipulating Scene Text in the Wild with Diffusion Models

Improving Diffusion Models for Scene Text Editing with Dual Encoders

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

Towards Real-time Text-driven Image Manipulation with Unconditional Diffusion Models

DiffSTR: Controlled Diffusion Models for Scene Text Removal

ECNet: Effective Controllable Text-to-Image Diffusion Models

From Text to Pose to Image: Improving Diffusion Model Control and Quality

TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

DiffusionSTR: Diffusion Model for Scene Text Recognition

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

TextDiffuser: Diffusion Models as Text Painters

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation