On Manipulating Scene Text in the Wild with Diffusion Models

Joshua Santoso,Christian Simon,Williem
2023-11-03
Abstract:Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of text editing in natural scenes, particularly the challenge of replacing text in images while preserving details such as color, font size, and background. Specifically, the paper proposes a diffusion model-based approach called **DBEST (Diffusion-Based Scene Text Manipulation Network)** to improve the performance of existing methods in editing text in complex scenes. #### Main Contributions 1. **Identifying the limitations of existing diffusion models**: Current image editing methods based on diffusion models perform poorly when editing text in natural scenes, especially in preserving details. 2. **Proposing a new diffusion model**: By employing a technique similar to text inversion, a diffusion model suitable for scene text editing is proposed. This model generates images without modifying the original LDM, making it easier to integrate with other text-related pipelines. 3. **Synthetic dataset**: As part of the training strategy, a synthetic dataset is introduced to avoid the cost of collecting real scene text data. 4. **One-shot style adaptation**: A one-shot style adaptation strategy is proposed to maintain the source style during the editing process. 5. **Text recognition guidance**: Utilizing a text recognition model for classifier guidance significantly improves performance in both quality and quantity. On the SynText dataset, the proposed network achieved an OCR word accuracy of 84.83%, a 13% improvement over the best existing method. #### Experimental Results - On the COCO-Text and ICDAR2013 datasets, the method achieved character-level evaluation accuracies of 94.15% and 98.12%, respectively. - Compared to existing methods, this approach excels in image quality (PSNR, SSIM, LPIPS) and OCR accuracy. - A series of ablation experiments validated the effectiveness of each component, particularly the importance of the synthetic dataset and text recognition guidance. Through these contributions, the paper demonstrates the superior performance of DBEST in natural scene text editing and provides new directions for research in this field.