Abstract:In this work, we present DeepEraser, an effective deep network for generic text removal. DeepEraser utilizes a recurrent architecture that erases the text in an image via iterative operations. Our idea comes from the process of erasing pencil script, where the text area designated for removal is subject to continuous monitoring and the text is attenuated progressively, ensuring a thorough and clean erasure. Technically, at each iteration, an innovative erasing module is deployed, which not only explicitly aggregates the previous erasing progress but also mines additional semantic context to erase the target text. Through iterative refinements, the text regions are progressively replaced with more appropriate content and finally converge to a relatively accurate status. Furthermore, a custom mask generation strategy is introduced to improve the capability of DeepEraser for adaptive text removal, as opposed to indiscriminately removing all the text in an image. Our DeepEraser is notably compact with only 1.4M parameters and trained in an end-to-end manner. To verify its effectiveness, extensive experiments are conducted on several prevalent benchmarks, including SCUT-Syn, SCUT-EnsText, and Oxford Synthetic text dataset. The quantitative and qualitative results demonstrate the effectiveness of our DeepEraser over the state-of-the-art methods, as well as its strong generalization ability in custom mask text removal. The codes and pre-trained models are available at
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the text removal in digital images. Specifically, its goal is to remove text from images and replace these areas with content that is coordinated with the surrounding environment. In a world that increasingly values data privacy, this technology has important application values, such as hiding sensitive information (such as addresses, license plate numbers, ID numbers, etc.), and playing a role in various applications such as intelligent education, text editing, image retrieval, and augmented reality translation.
### Main contributions of the paper
1. **Propose DeepEraser**: an end - to - end deep network for general - purpose text removal. DeepEraser adopts a recursive structure to remove text through iterative context mining and context updating.
2. **Introduce a custom mask generation strategy**: to enhance the ability of adaptive text removal, rather than simply removing all text areas in the image.
3. **Lightweight design**: DeepEraser has only 1.4M parameters, and the training objective is simple, only calculating the L1 distance between the predicted text - free image and the real image.
4. **Extensive experimental verification**: A large number of experiments have been carried out on multiple popular datasets (such as SCUT - Syn, SCUT - EnsText, and Oxford Synthetic text dataset). Quantitative and qualitative results show that DeepEraser outperforms existing methods in performance and shows strong generalization ability in custom - mask text removal.
### Technical details
1. **Custom mask generation**:
- **Training phase**: Randomly select text instances in the image and generate a binary mask \( M_0 \), indicating the text areas to be removed.
- **Inference stage**: Existing text detectors can be used or text areas to be removed can be manually marked.
2. **Feature extraction**:
- Concatenate the text image \( I_0 \) and the mask \( M_0 \) along the channel dimension and input them into a CNN - based backbone network for feature extraction.
- The backbone network consists of six residual blocks without any down - sampling operations, generating a more refined feature map.
3. **Iterative text removal**:
- The core component is the erasing module, which iteratively refines the current removal results.
- In the \( k \) - th iteration, the erasing module receives the context feature \( E_I \), the previously estimated text - free image \( I_{k - 1} \) and the latent feature \( l_{k - 1} \), and outputs the updated latent feature \( l_k \) and the current residual image \( r_k \).
- The update formula is: \[ I_k = I_0 + r_k \]
4. **Training objective**:
- The loss function is defined as the sum of the L1 distances between the predicted text - free image and the real image in each iteration step: \[ L=\sum_{k = 1}^K\lambda^{K - k}\| I_{gt}-I_k \|_1 \]
- where \( \lambda \) is a weight factor less than 1, used to balance the importance of different iteration steps.
### Experimental results
- **Datasets**: SCUT - EnsText, SCUT - Syn, and Oxford Synthetic text dataset.
- **Evaluation metrics**: including PSNR, MSSIM, MSE, AGE, pEPs, pCEPS, etc.
- **Quantitative and qualitative results**: DeepEraser performs well on multiple datasets, especially in custom - mask text removal.
### Conclusion
DeepEraser effectively solves the problem of text removal in digital images through a recursive structure and iterative context mining. Its lightweight design and simple training objective make it highly practical and efficient in practical applications.