TextDestroyer: A Training- and Annotation-Free Diffusion Method for Destroying Anomal Text from Images

Mengcheng Li,Mingbao Lin,Fei Chao,Chia-Wen Lin,Rongrong Ji
2024-11-01
Abstract:In this paper, we propose TextDestroyer, the first training- and annotation-free method for scene text destruction using a pre-trained diffusion model. Existing scene text removal models require complex annotation and retraining, and may leave faint yet recognizable text information, compromising privacy protection and content concealment. TextDestroyer addresses these issues by employing a three-stage hierarchical process to obtain accurate text masks. Our method scrambles text areas in the latent start code using a Gaussian distribution before reconstruction. During the diffusion denoising process, self-attention key and value are referenced from the original latent to restore the compromised background. Latent codes saved at each inversion step are used for replacement during reconstruction, ensuring perfect background restoration. The advantages of TextDestroyer include: (1) it eliminates labor-intensive data annotation and resource-intensive training; (2) it achieves more thorough text destruction, preventing recognizable traces; and (3) it demonstrates better generalization capabilities, performing well on both real-world scenes and generated images.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of scene text removal in images, especially how to completely destroy the text information in images without additional training and annotation. Specifically, existing scene text removal models usually require complex annotation and retraining, and may leave blurry but still recognizable text traces, thus affecting the effectiveness of privacy protection and content hiding. These problems are particularly important when dealing with sensitive information (such as license plate numbers, telephone numbers, etc.). #### Main contributions of the paper 1. **No training and annotation required**: TextDestroyer is the first method that can destroy scene text without additional training or annotation. This greatly simplifies the text removal process, avoids cumbersome data annotation and resource - intensive model training, making it more efficient and practical. 2. **Enhanced text destruction and background restoration**: This method adopts a three - stage hierarchical process, which not only ensures the complete destruction of the text, but also enhances the effect of background restoration. By using Gaussian noise to disrupt the text area and reconstruct the image during the diffusion denoising process, this method can minimize visual distortion and maintain the quality of non - text areas. #### Method overview The core idea of TextDestroyer is to use a pre - trained diffusion model to achieve text destruction without training and annotation. The specific steps are as follows: 1. **Hierarchical text localization**: - **Initial text capture**: Segment and capture the initial text area by aggregating multiple token - level attention maps. - **Continuous text adjustment**: Crop and enlarge each text area in the original image, and gradually adjust for a better text area. - **Fine - grained text division**: Perform binary clustering analysis on the original image to finally determine the exact text boundaries. 2. **Text area destruction**: - Replace the latent code of the text area with random Gaussian noise to prevent the original text from being restored during the denoising process. 3. **Non - text area restoration**: - During the denoising process, guide the image reconstruction through KV combination (key - value combination), replace the wrong latent code to reduce background distortion. #### Experimental results The paper conducted experiments on the SCUT - Enstext dataset and compared with a variety of existing methods (such as EraseNet, MTRNet, GaRNet, STRDD, DeepEraser, CTRNet). The results show that although TextDestroyer is not as good as some specially trained models in some quantitative indicators, it performs excellently in qualitative evaluation, especially in the performance on the generated images. In general, TextDestroyer provides a novel and efficient solution that can completely destroy the text information in images without additional training and annotation, and is suitable for application scenarios such as privacy protection and content hiding.