Abstract:In this paper, we propose TextDestroyer, the first training- and annotation-free method for scene text destruction using a pre-trained diffusion model. Existing scene text removal models require complex annotation and retraining, and may leave faint yet recognizable text information, compromising privacy protection and content concealment. TextDestroyer addresses these issues by employing a three-stage hierarchical process to obtain accurate text masks. Our method scrambles text areas in the latent start code using a Gaussian distribution before reconstruction. During the diffusion denoising process, self-attention key and value are referenced from the original latent to restore the compromised background. Latent codes saved at each inversion step are used for replacement during reconstruction, ensuring perfect background restoration. The advantages of TextDestroyer include: (1) it eliminates labor-intensive data annotation and resource-intensive training; (2) it achieves more thorough text destruction, preventing recognizable traces; and (3) it demonstrates better generalization capabilities, performing well on both real-world scenes and generated images.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of scene text removal in images, especially how to completely destroy the text information in images without additional training and annotation. Specifically, existing scene text removal models usually require complex annotation and retraining, and may leave blurry but still recognizable text traces, thus affecting the effectiveness of privacy protection and content hiding. These problems are particularly important when dealing with sensitive information (such as license plate numbers, telephone numbers, etc.). #### Main contributions of the paper 1. **No training and annotation required**: TextDestroyer is the first method that can destroy scene text without additional training or annotation. This greatly simplifies the text removal process, avoids cumbersome data annotation and resource - intensive model training, making it more efficient and practical. 2. **Enhanced text destruction and background restoration**: This method adopts a three - stage hierarchical process, which not only ensures the complete destruction of the text, but also enhances the effect of background restoration. By using Gaussian noise to disrupt the text area and reconstruct the image during the diffusion denoising process, this method can minimize visual distortion and maintain the quality of non - text areas. #### Method overview The core idea of TextDestroyer is to use a pre - trained diffusion model to achieve text destruction without training and annotation. The specific steps are as follows: 1. **Hierarchical text localization**: - **Initial text capture**: Segment and capture the initial text area by aggregating multiple token - level attention maps. - **Continuous text adjustment**: Crop and enlarge each text area in the original image, and gradually adjust for a better text area. - **Fine - grained text division**: Perform binary clustering analysis on the original image to finally determine the exact text boundaries. 2. **Text area destruction**: - Replace the latent code of the text area with random Gaussian noise to prevent the original text from being restored during the denoising process. 3. **Non - text area restoration**: - During the denoising process, guide the image reconstruction through KV combination (key - value combination), replace the wrong latent code to reduce background distortion. #### Experimental results The paper conducted experiments on the SCUT - Enstext dataset and compared with a variety of existing methods (such as EraseNet, MTRNet, GaRNet, STRDD, DeepEraser, CTRNet). The results show that although TextDestroyer is not as good as some specially trained models in some quantitative indicators, it performs excellently in qualitative evaluation, especially in the performance on the generated images. In general, TextDestroyer provides a novel and efficient solution that can completely destroy the text information in images without additional training and annotation, and is suitable for application scenarios such as privacy protection and content hiding.

TextDestroyer: A Training- and Annotation-Free Diffusion Method for Destroying Anomal Text from Images

DiffSTR: Controlled Diffusion Models for Scene Text Removal

MTRNet: A Generic Scene Text Eraser

TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles

Improving Diffusion Models for Scene Text Editing with Dual Encoders

On Manipulating Scene Text in the Wild with Diffusion Models

Stroke-Based Scene Text Erasing Using Synthetic Data for Training

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

TextDiffuser: Diffusion Models as Text Painters

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion

Ablating Concepts in Text-to-Image Diffusion Models

PERT: A Progressively Region-based Network for Scene Text Removal

Scene text removal via cascaded text stroke detection and erasing

Scene Text Eraser

DiffUHaul: A Training-Free Method for Object Dragging in Images

DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models

A Simple and Strong Baseline: Progressively Region-based Scene Text Removal Networks

Progressive Scene Text Erasing with Self-Supervision.

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

What is the Real Need for Scene Text Removal? Exploring the Background Integrity and Erasure Exhaustivity Properties