Masked Text Modeling: A Self-Supervised Pre-training Method for Scene Text Detection

Keran Wang,Hongtao Xie,Yuxin Wang,Dongming Zhang,Yadong Qu,Zuan Gao,Yongdong Zhang
DOI: https://doi.org/10.1145/3581783.3612370
2023-01-01
Abstract:Scene text detection has made great progress recently with the wide use of pre-training. Nonetheless, existing scene text detection methods still suffer from two problems: 1) Limited annotated real data reduces the feature robustness. 2) Detectors perform poorly on text lacking of visual information. In this paper, we explore the potential of the CLIP model, and propose a novel self-supervised Masked Text Modeling (MTM) pre-training method for scene text detection, which can be trained with unlabeled data and improve the linguistic reasoning ability for text occlusion. Different from previous randomly pixel-level masking methods, MTM performs a targeted text-aware masking process under an unsupervised manner. Specifically, MTM consists of text perception and masked text modeling. In the text perception step, benefiting from the text-friendliness of CLIP, a Text Perception Module is proposed to attend to text area by computing the similarity between the text and image tokens from CLIP model. In the masked text modeling step, a Text-aware Masking Strategy is designed to mask the text area, and the Masked Text Modeling Module is used to reconstruct the masked texts. MTM obtains the ability to reason the linguistic information of masked texts with the reconstruction. This robust feature extraction learned by MTM ensures a more discriminative representation for the text lacking of visual information. Moreover, a new text dataset named OcclusionText is proposed to evaluate the robustness for text occlusion of detection methods. Extensive experiments on public benchmarks demonstrate that our MTM can boost the performance of existing text detectors.
What problem does this paper attempt to address?