Abstract:Scene text detection has made great progress recently with the wide use of pre-training. Nonetheless, existing scene text detection methods still suffer from two problems: 1) Limited annotated real data reduces the feature robustness. 2) Detectors perform poorly on text lacking of visual information. In this paper, we explore the potential of the CLIP model, and propose a novel self-supervised Masked Text Modeling (MTM) pre-training method for scene text detection, which can be trained with unlabeled data and improve the linguistic reasoning ability for text occlusion. Different from previous randomly pixel-level masking methods, MTM performs a targeted text-aware masking process under an unsupervised manner. Specifically, MTM consists of text perception and masked text modeling. In the text perception step, benefiting from the text-friendliness of CLIP, a Text Perception Module is proposed to attend to text area by computing the similarity between the text and image tokens from CLIP model. In the masked text modeling step, a Text-aware Masking Strategy is designed to mask the text area, and the Masked Text Modeling Module is used to reconstruct the masked texts. MTM obtains the ability to reason the linguistic information of masked texts with the reconstruction. This robust feature extraction learned by MTM ensures a more discriminative representation for the text lacking of visual information. Moreover, a new text dataset named OcclusionText is proposed to evaluate the robustness for text occlusion of detection methods. Extensive experiments on public benchmarks demonstrate that our MTM can boost the performance of existing text detectors.

Self-Training for Domain Adaptive Scene Text Detection

Domain Adaptive Scene Text Detection via Subcategorization

Incorporating Self-attention Mechanism and Multi-task Learning into Scene Text Detection

Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection

WeText: Scene Text Detection under Weak Supervision

Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition

Hard-aware Instance Adaptive Self-training for Unsupervised Cross-domain Semantic Segmentation

Self-supervised Scene Text Segmentation with Object-centric Layered Representations Augmented by Text Regions

MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Accurate Scene Text Detection Via Scale-Aware Data Augmentation and Shape Similarity Constraint

Stratified Domain Adaptation: A Progressive Self-Training Approach for Scene Text Recognition

Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors

Texts As Lines: Text Detection with Weak Supervision

Instance Adaptive Self-training for Unsupervised Domain Adaptation

Chinese Text Detection Using Deep Learning Model And Synthetic Data

A Unified Deep Neural Network For Scene Text Detection

Masked Retraining Teacher-Student Framework for Domain Adaptive Object Detection

Tracking Based Semi-Automatic Annotation for Scene Text Videos

Turning a CLIP Model into a Scene Text Detector

Text Recognition in Real Scenarios with a Few Labeled Samples

Domain Adaptation Curriculum Learning for Scene Text Detection in Inclement Weather Conditions