Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

Zixiao Wang,Hongtao Xie,YuXin Wang,Yadong Qu,Fengjun Guo,Pengwei Liu
2024-09-20
Abstract:Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at <a class="link-external link-https" href="https://github.com/wzx99/TMIM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **the problem of insufficient training data in the Scene Text Removal (STR) task due to expensive pixel - level annotation**. ### Detailed Explanation 1. **Problem Background**: - Scene Text Removal (STR) is very important in many practical applications, such as privacy protection and image editing. - Existing STR methods usually rely on pixel - level annotated data for supervised learning. However, these annotations are very expensive and time - consuming (it takes 5 - 10 minutes for each image), so the scale of existing STR datasets is relatively small (for example, 2.7k vs 111k of the STD dataset), which limits the potential of advanced STR models. 2. **Limitations of Existing Solutions**: - Some works use synthetic datasets for training, but there is a domain gap between synthetic data and real data, resulting in limited performance. - Other methods attempt to use additional data for pre - training. However, these methods usually design indirect pre - training tasks to implicitly learn feature representations, lacking specificity for the STR task, resulting in redundancy and inefficiency. 3. **The Method Proposed in the Paper**: - The paper proposes a weakly - supervised pre - training framework named **Text - aware Masked Image Modeling (TMIM)**, which aims to directly use text detection labels in the Scene Text Detection (STD) dataset for STR training. - TMIM contains two parallel streams: - **Background Modeling (BM) stream**: It learns to generate non - text background content and provides pseudo - STR labels. - **Text Erasing (TE) stream**: It learns from the pseudo - labels to achieve end - to - end STR capabilities. - Through the cooperation of these two streams, TMIM can achieve impressive performance only using public text detection datasets, greatly alleviating the limitations of high - cost STR labels. 4. **Experimental Results**: - Experiments show that TMIM not only outperforms other pre - training methods but also achieves state - of - the - art performance on the SCUT - EnsText dataset (PSNR is 37.35). - The pre - training model also achieves significant STR performance (PSNR is 36.62) only using the STD dataset, exceeding all compared fully - supervised non - fine - tuned models. ### Summary This paper solves the problem of insufficient training data in the STR task by introducing the TMIM framework, significantly improves the performance of the model, and provides a new perspective for STR training with low annotation costs.