Abstract:Scene text detection has made great progress recently with the wide use of pre-training. Nonetheless, existing scene text detection methods still suffer from two problems: 1) Limited annotated real data reduces the feature robustness. 2) Detectors perform poorly on text lacking of visual information. In this paper, we explore the potential of the CLIP model, and propose a novel self-supervised Masked Text Modeling (MTM) pre-training method for scene text detection, which can be trained with unlabeled data and improve the linguistic reasoning ability for text occlusion. Different from previous randomly pixel-level masking methods, MTM performs a targeted text-aware masking process under an unsupervised manner. Specifically, MTM consists of text perception and masked text modeling. In the text perception step, benefiting from the text-friendliness of CLIP, a Text Perception Module is proposed to attend to text area by computing the similarity between the text and image tokens from CLIP model. In the masked text modeling step, a Text-aware Masking Strategy is designed to mask the text area, and the Masked Text Modeling Module is used to reconstruct the masked texts. MTM obtains the ability to reason the linguistic information of masked texts with the reconstruction. This robust feature extraction learned by MTM ensures a more discriminative representation for the text lacking of visual information. Moreover, a new text dataset named OcclusionText is proposed to evaluate the robustness for text occlusion of detection methods. Extensive experiments on public benchmarks demonstrate that our MTM can boost the performance of existing text detectors.

Incorporating Self-attention Mechanism and Multi-task Learning into Scene Text Detection

Text-Attentional Convolutional Neural Network for Scene Text Detection

Text-Attentional Convolutional Neural Networks for Scene Text Detection

Combining Swin Transformer and Attention-Weighted Fusion for Scene Text Detection

A Multi-Level Feature Fusion Network for Scene Text Detection with Text Attention Mechanism

MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Mask is All You Need: Rethinking Mask R-CNN for Dense and Arbitrary-Shaped Scene Text Detection

A Unified Deep Neural Network For Scene Text Detection

Improved CTPN Based Attention Mechanism for Scene Text Detection

Deep Neural Network with Attention Model for Scene Text Recognition.

DPNet: Scene text detection based on dual perspective CNN-transformer

Mask Scene Text Recognizer

A Multi-Scale Natural Scene Text Detection Method Based on Attention Feature Extraction and Cascade Feature Fusion

Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection

Aggregated Text Transformer for Scene Text Detection

Efficient Scene Text Detection with Textual Attention Tower

Scene Text Detection with Fully Convolutional Neural Networks

DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection

Using of Attention for Scene Text Detection

SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition

AB-LSTM: Attention-based Bidirectional LSTM Model for Scene Text Detection