Abstract:Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at <a class="link-external link-https" href="https://github.com/wzx99/TMIM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **the problem of insufficient training data in the Scene Text Removal (STR) task due to expensive pixel - level annotation**. ### Detailed Explanation 1. **Problem Background**: - Scene Text Removal (STR) is very important in many practical applications, such as privacy protection and image editing. - Existing STR methods usually rely on pixel - level annotated data for supervised learning. However, these annotations are very expensive and time - consuming (it takes 5 - 10 minutes for each image), so the scale of existing STR datasets is relatively small (for example, 2.7k vs 111k of the STD dataset), which limits the potential of advanced STR models. 2. **Limitations of Existing Solutions**: - Some works use synthetic datasets for training, but there is a domain gap between synthetic data and real data, resulting in limited performance. - Other methods attempt to use additional data for pre - training. However, these methods usually design indirect pre - training tasks to implicitly learn feature representations, lacking specificity for the STR task, resulting in redundancy and inefficiency. 3. **The Method Proposed in the Paper**: - The paper proposes a weakly - supervised pre - training framework named **Text - aware Masked Image Modeling (TMIM)**, which aims to directly use text detection labels in the Scene Text Detection (STD) dataset for STR training. - TMIM contains two parallel streams: - **Background Modeling (BM) stream**: It learns to generate non - text background content and provides pseudo - STR labels. - **Text Erasing (TE) stream**: It learns from the pseudo - labels to achieve end - to - end STR capabilities. - Through the cooperation of these two streams, TMIM can achieve impressive performance only using public text detection datasets, greatly alleviating the limitations of high - cost STR labels. 4. **Experimental Results**: - Experiments show that TMIM not only outperforms other pre - training methods but also achieves state - of - the - art performance on the SCUT - EnsText dataset (PSNR is 37.35). - The pre - training model also achieves significant STR performance (PSNR is 36.62) only using the STD dataset, exceeding all compared fully - supervised non - fine - tuned models. ### Summary This paper solves the problem of insufficient training data in the STR task by introducing the TMIM framework, significantly improves the performance of the model, and provides a new perspective for STR training with low annotation costs.

Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining

Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection

Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition

Maskstr: Guide Scene Text Recognition Models with Masking

DiffSTR: Controlled Diffusion Models for Scene Text Removal

PERT: A Progressively Region-based Network for Scene Text Removal

PSSTRNet: Progressive Segmentation-guided Scene Text Removal Network

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Masked and Permuted Implicit Context Learning for Scene Text Recognition

Scene Text Recognition with Self-supervised Contrastive Predictive Coding

Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing

Flexible scene text recognition based on dual attention mechanism

Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition

Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement

SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor

Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

One Model for Two Tasks: Cooperatively Recognizing and Recovering Low-Resolution Scene Text Images by Iterative Mutual Guidance

CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model

What is the Real Need for Scene Text Removal? Exploring the Background Integrity and Erasure Exhaustivity Properties

FETNet: Feature Erasing and Transferring Network for Scene Text Removal