Abstract:Scene text detection has made great progress recently with the wide use of pre-training. Nonetheless, existing scene text detection methods still suffer from two problems: 1) Limited annotated real data reduces the feature robustness. 2) Detectors perform poorly on text lacking of visual information. In this paper, we explore the potential of the CLIP model, and propose a novel self-supervised Masked Text Modeling (MTM) pre-training method for scene text detection, which can be trained with unlabeled data and improve the linguistic reasoning ability for text occlusion. Different from previous randomly pixel-level masking methods, MTM performs a targeted text-aware masking process under an unsupervised manner. Specifically, MTM consists of text perception and masked text modeling. In the text perception step, benefiting from the text-friendliness of CLIP, a Text Perception Module is proposed to attend to text area by computing the similarity between the text and image tokens from CLIP model. In the masked text modeling step, a Text-aware Masking Strategy is designed to mask the text area, and the Masked Text Modeling Module is used to reconstruct the masked texts. MTM obtains the ability to reason the linguistic information of masked texts with the reconstruction. This robust feature extraction learned by MTM ensures a more discriminative representation for the text lacking of visual information. Moreover, a new text dataset named OcclusionText is proposed to evaluate the robustness for text occlusion of detection methods. Extensive experiments on public benchmarks demonstrate that our MTM can boost the performance of existing text detectors.

Masked Visual-Textual Prediction for Document Image Representation Pretraining

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

Sequence-to-Sequence Pre-training with Unified Modality Masking for Visual Document Understanding

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

SelfDoc: Self-Supervised Document Representation Learning

Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Maskstr: Guide Scene Text Recognition Models with Masking

Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Toward High Quality Facial Representation Learning

A Unified View of Masked Image Modeling

Self-supervised Pre-training of Text Recognizers

MaskViT: Masked Visual Pre-Training for Video Prediction

Document Image Layout Analysis via MASK Constraint

MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval

LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding